1. Business Understanding
This initial phase focuses on understanding the project objectives and requirements from a business perspective, then converting this knowledge into a more concrete Machine Learning or AI problem definition and a preliminary plan designed to achieve the objectives.
1.1. Business Goals
The first objective is to thoroughly understand, from a business perspective, what the client really wants to accomplish. Often the client has many competing objectives and constraints that must be properly balanced. Your goal is to uncover important factors that can influence the outcome of the project. A possible consequence of neglecting this step is to expend a great deal of effort producing the right answers to the wrong questions.
1.1.1. 📖 Background Report
Collate the information that is known about the organization's business situation at the start of the project. These details not only serve to more closely identify the business goals that will be solved, but also serve to identify resources, both human and material, that may be used or needed during the course of the project.
❓ Organization
- Develop organizational charts identifying divisions, departments and project groups. The chart should also identify managers' names and responsibilities.
- Identify key persons in the business and their roles.
- Identify an internal sponsor (financial sponsor and primary user/domain expert).
- Identify the business units which are impacted by the data mining project (e.g., Marketing, Sales, Finance).
- Identify any other key stakeholders (e.g., customers, suppliers, regulatory agencies).
❓ Problem area
- Identify the problem area (e.g., Marketing, Customer Care, Business Development, etc.).
- Describe the problem in general terms.
- Check the current status of the project (e.g., Check if it is already clear within the business unit that we are performing a ML/AI project or do we need to advertise ML/AI as a key technology in the business?).
- Clarify prerequisites of the project (e.g., what is the motivation of the project? Does the business already use machine learning?).
- If necessary, prepare presentations and present ML/AI to the business.
- Identify target groups for the project result (e.g., Do we expect a written report for top management or do we expect a running system that is used by naive end users?).
- Identify the users' needs and expectations.
❓ Current solution
- Describe any solution currently in use for the problem.
- Describe the advantages and disadvantages of the current solution and the level to which it is accepted by the users.
1.1.2. 📖 Business objectives
Describe the customer's primary objective, from a business perspective, in the data mining project. In addition to the primary business objective, there are typically a large number of related business questions that the customer would like to address. For example, the primary business goal might be to keep current customers by predicting when they are prone to move to a competitor, while secondary business objectives might be to determine whether lower fees affect only one particular segment of customers.
- Informally describe the problem which is supposed to be solved with machine learning.
- Specify all business questions as precisely as possible.
- Specify any other business requirements (e.g., the business does not want to lose any customers).
- Specify expected benefits in business terms.
1.1.3. 📖 Business success criteria
Describe the criteria for a successful or useful outcome to the project from the business point of view. This might be quite specific and able to be measured objectively, such as reduction of customer churn to a certain level or general and subjective such as "give useful insights into the relationships." In the latter case it should be indicated who makes the subjective judgment.
- Specify business success criteria (e.g., improve response rate in a mailing campaign by 10 percent and sign-up rate increased by 20 percent). Specifying measurable benefits ML is expected to bring to the organization. E.g., increase the revenue in X%, increase the number of units sold in Y%, number of trees saved
- Specifying what the users want to achieve by using ML. E.g., for recommendation systems this could involve helping users finding content they will enjoy.
- Specifying measures correlating with future success, from the business’ perspective. This could include the users’ affective states when using the ML-enabled system (e.g., customer sentiment and engagement).
- Identify who assesses the success criteria.
Beware of setting unattainable goals - make them as realistic as possible. Customers see ML as magic, in other words, "ML will solve everything". Let stakeholders know the limitations and manage their expectations.
Each of the success criteria should relate to at least one of the specified business objectives.
1.2. Assess situation
This task involves more detailed fact-finding about all the resources, constraints, assumptions and other factors that should be considered in determining the project goal and plan. In the previous task, your objective is to quickly get to the crux of the situation. Here, you want to flesh out the details.
1.2.1. 📖 Inventory of resources
List the resources available to the project, including: personnel (business experts, data experts, technical support, data mining personnel), data (fixed extracts, access to live warehoused or operational data), computing resources (hardware platforms) and software (data mining tools, other relevant software).
❓ Sources of data and knowledge
- Identify the data sources.
- Identify the type of data sources (e.g., on-line sources, experts, written documentation, etc.).
- Identify the knowledge sources.
- Identify the type of knowledge sources (e.g., on-line sources, experts, written documentation, etc.).
- Check available tools and techniques.
- Describe the relevant background knowledge (informally or formally).
❓ Personnel sources
- Identify system administrator, database administrator and technical support staff for further questions.
- Identify market analysts, data mining experts and statisticians and check their availability.
- Check availability of domain experts for later phases.
Remember that the project may need technical staff at odd times throughout the project, for example during Data Transformation.
1.2.2. 📖 Requirements Report
List all requirements of the project including schedule of completion, comprehensibility and quality of results and security as well as legal issues. As part of this output, make sure that you are allowed to use the data.
- Identify the requirements on scheduling (e.g., when do we need to deliver the first results? When do we need to deliver the final results?).
- Identify the requirements on comprehensibility, accuracy, deployability, maintainability and repeatability of the ML/AI project and the resulting model(s).
- Identify the requirements on security, legal restrictions, privacy, reporting and project schedule.
❓ Infrastructure Requirements
- Specify what data streaming strategy will be used (e.g., real-time data transportation or in batches).
- Specify how the model of the ML-enabled system will be executed and consumed (e.g., client-side, back-end, cloud-based, web service end-point).
- Specify the need for ML-enabled system abilities to continuously learn from new data, extending the existing model's knowledge.
- Specify the integration that the model will have with the rest of the system functionality.
- Specify how the system deals with risks to prevent dangerous failures.
- Analyze the probability of the occurrence of harm and its severity for critical systems that incorporate ML.
- Specify how the system deals with security issues (e.g., vulnerabilities) to protect the data. (ML systems often contain sensitive data that should be protected.)
- Specify where the ML artifacts (e.g., models, data, scripts) will be stored.
- Specify what ML-enabled system data needs to be collected (Telemetry involves collecting data such as clicks on particular buttons and could involve other usage and performance monitoring data).
- Identify the infrastructure requirements (e.g., hardware, software, data, etc.).
- Identify key interaction methods.
- Identify any security constraints or requirements (e.g., Is the system handling sensitive data?).
❓ Performance and Efficiency
- Identify any requirements related to performance – the ability of the system to perform actions within defined time and throughput bounds.
- Identify any hardware limitations that may require extra efficiency of the system.
- Identify the need for scalability – the ability to increase or decrease the capacity of the system in response to changing demands.
- Specify the acceptable time to execute the model and return the predictions.
- Specify the acceptable time to train the model if any.
❓ User Experience Perspective
- User Expectations – Specify expectations of customers and end-user in terms of how the system should behave (e.g., how often they expect predictions to be right or wrong).
- Accountability – Specify who is responsible for unexpected model results or actions taken based on unexpected model results.
- Cost Analysis – Evaluate execution costs and impact of incorrect predictions.
- User Guidance – Specify how strongly the system forces the user to do what the model indicates they should (e.g., automatic or assisted actions).
- Interaction Frequency – Specify how often the system interacts with users (e.g., interact whenever the user asks for it or whenever the system thinks the user will respond).
- User Involvement – Specify what interactions the users will have with the ML-enabled system, (e.g., to provide new data for learning, or human-in-the-loop systems where models require human interaction).
- Visualization – Specify methods for presenting ML outcomes comprehensibly.
❓ Bias, ethics, fairness
- Bias Mitigation: Outline strategies to prevent systematic prejudices in results.
- Ethical Considerations: Define requirements for ensuring moral behavior in the system.
- Fairness Metrics: Establish measures to ensure unbiased operation.
- Unbiased Dataset Selection: Define criteria for choosing unbiased training data.
❓ Stability
- Stability: Define requirements for consistent performance under varying conditions.
- Robustness: Specify system resilience to unexpected inputs or environmental changes.
- Flexibility: Outline system adaptability to changing demands or conditions.
❓ Privacy
- Specify requirements to prevent individual data identification.
❓ Legal and Regulatory Requirements
- Specify legal constraints on data use and storage.
- Compliance: Identify and document all relevant legal and regulatory requirements.
- Data Protection Laws: Specify adherence to regulations like GDPR.
- Industry-Specific Regulations: Outline compliance with sector-specific legal frameworks.
❓ Quantitative Targets
- Identify what quantitative metrics are required
The number of correctly predicted data points out of all the data points. What is the business impact of a false positive? A false negative? How should the model be tuned to maximize business results?
The dominant concerns … pertain to decision-making with the customers. In the conventional setting, this activity involved requirements analysis and specification in the initial phase and an acceptance inspection in the final phase. This activity flow is not possible when working with ML-based systems due to the impossibility of prior estimation or assurance of achievable accuracy.
❓ Explainability, Interpretability & Justifiability Specifying the need to understand reasons of the model inferences. The model might need to be able to summarize the reasons of its decisions. Other related concerns, such as transparency and interpretability, may apply. Explainability - The extent to which the internal mechanics of ML-enabled system can be explained in human terms. Interpretability - The extraction of relevant knowledge from a model concerning relationships either contained in data or learned by the model Justifiability - The ability to be show the output of an ML-enabled system to be right or reasonable. Transparency - The extent to which a human user can infer why the system made a particular decision or produced a particular externally-visible behaviour. The motivation for versioning, provenance, and logging (and transparency below) is so that a decision made by a model is auditable. Auditing is a way of proving that a model is indeed fair, accountable, and transparent. A good criterion for a model being auditable might be that it is sufficient to perform a root cause analysis for a given event, say an unexpected model decision. A root cause is defined as a precursor event that without which the event being investigated would not have occurred, or would not recur. Other causes may influence an event, but the root cause is the crucial first step in the chain without which the event cannot occur. Explainability is twofold: On the one side, there is a need to explain the model (what has been learned). On the other side, there is a need to explain single predictions of the model. P4 mentioned that explainability may be even more important than predictive power: “Often, we constrain the models to derive explanations. We look for models that partition the input attributes to show relations between input and output. This decreases the predictive power but is usually favored by our customers.”
❓ Usability The extent to which a system can be used by specified users to achieve specified goals with effectiveness, efficiency and satisfaction in a specified context of use.
❓ Safety The absence of failures or conditions that render a system dangerous
1.2.3. 📖 Assumptions
List the assumptions made by the project. These may be assumptions about the data that can be checked during data mining, but may also include non-checkable assumptions about the business upon which the project rests. It is particularly important to list the latter if they form conditions on the validity of the results.
- Clarify all assumptions (including implicit ones) and make them explicit (e.g., To address the business question, a minimum number of customers with age above 50 is necessary).
- List assumptions on data quality (e.g., accuracy, availability).
- List assumptions on external factors (e.g., economic issues, competitive products, technical advances).
- Clarify assumptions that lead to any of the estimates (e.g., the price of a specific tool is assumed to be lower than $1000).
- List all assumptions on whether it is necessary to understand and describe or explain the model. (E.g., How should the model and results be presented to senior management/sponsor.).
1.2.4. 📖 Constraints
List the constraints on the project. These may be constraints on the availability of resources, but may also include technological constraints such as the size of data that it is practical to use for modeling.
- Check general constraints (e.g., legal issues, budget, timescales and resources).
- Check access rights to data sources (e.g., access restrictions, password required).
- Check technical accessibility of data (operating systems, data management system, file or database format).
- Check whether relevant knowledge is accessible.
- Check budget constraints (Fixed costs, implementation costs, etc.).
1.2.5. 📖 Terminology
Compile a glossary of terminology relevant to the project. This may include two components:
- A glossary of relevant business terminology, which forms part of the business understanding available to the project. Constructing this glossary is a useful “knowledge elicitation” and education exercise.
- A glossary of data mining terminology, illustrated with examples relevant to the business problem in question.
- Check prior availability of glossaries, otherwise begin to draft glossaries.
- Talk to domain experts to understand their terminology.
- Become familiar with the business terminology.
1.2.8. 📖 Costs and benefits
Construct a cost-benefit analysis for the project, which compares the costs of the project with the potential benefit to the business if it is successful. The comparison should be as specific as possible, for example using monetary measures in a commercial situation.
The comparison should be as specific as possible, as this enables a better business case to be made.
- Estimate costs for data collection.
- Estimate costs of developing and implementing a solution.
- Identify benefits when a solution is deployed (e.g. improved customer satisfaction, ROI and increase in revenue).
- Estimate operating costs.
Remember to identify hidden costs such as repeated data extraction and preparation, changes in work flows and training time during learning.
1.2.9. 📖 Risks and Contingencies
List the risks, that is, events that might occur, impacting schedule, cost or result. List the corresponding contingency plans; what action will be taken to avoid or minimize the impact or recover from the occurrence of the foreseen risks.
- Identify business risks (e.g., competitor comes up with better results first).
- Identify organizational risks (e.g., department requesting project not having funding for project).
- Identify financial risks (e.g., further funding depends on initial data mining results).
- Identify technical risks.
- Identify risks that depend on data and data sources (e.g. poor quality and coverage).
- Determine conditions under which each risk may occur.
- Develop contingency plans.
1.3. Determine data mining goals
A business goal states objectives in business terminology. A data mining goal states project objectives in technical terms. For example, the business goal might be “Increase catalog sales to existing customers.” A data mining goal might be “Predict how many widgets a customer will buy, given their purchases over the past three years, demographic information (age, salary, city, etc.) and the price of the item.”
- Data mining goals Describe the intended outputs of the project that enables the achievement of the business objectives.
- Data mining success criteria Define the criteria for a successful outcome to the project in technical terms, for example a certain level of predictive accuracy or a propensity to purchase profile with a given degree of “lift.” As with business success criteria, it may be necessary to describe these in subjective terms, in which case the person or persons making the subjective judgment should be identified. Specifying metrics and acceptable measures the model should achieve (e.g., for classification problems this could involve accuracy ≥ X%, precision ≥ Y%, recall ≥ Z%). Specifying the ML results in terms of functionality that the model will provide (e.g., classify customers, predict probabilities).
1.4. Produce project plan
Describe the intended plan for achieving the data mining goals and thereby achieving the business goals. The plan should specify the anticipated set of steps to be performed during the rest of the project including an initial selection of tools and techniques.
1.4.1. Maintainability
The ease with which a system or component can be modified to correct faults, improve performance or other attributes, or adapt to a changed environment
1.4.2. Portability
The ability to transfer a system or element of a system from one environment to another.
1.4.3. Reliability
The probability of the software performing without failure for a specific number of uses or amount of time.
- Project plan List the stages to be executed in the project, together with duration, resources required, inputs, outputs and dependencies. Where possible make explicit the large-scale iterations in the data mining process, for example repetitions of the modeling and evaluation phases. As part of the project plan, it is also important to analyze dependencies between time schedule and risks. Mark results of these analyses explicitly in the project plan, ideally with actions and recommendations if the risks appear. Note: the project plan contains detailed plans for each phase. For example, decide at this point which evaluation strategy will be used in the evaluation phase. The project plan is a dynamic document in the sense that at the end of each phase a review of progress and achievements is necessary and an update of the project plan accordingly is recommended. Specific review points for these reviews are part of the project plan, too.
- Initial assessment of tools and techniques. At the end of the first phase, the project also performs an initial assessment of tools and techniques. Here, you select a data mining tool that supports various methods for different stages of the process, for example. It is important to assess tools and techniques early in the process since the selection of tools and techniques possibly influences the entire project.
- Compromises. In data-driven intelligent applications, the satisfaction of business goals are constrained by limitations of technological solutions. Sometimes the business experts have to make compromises and accept a less than expected solution.
1.5. 👨🏫 Educating Stakeholders
1.5.1. 👨🏫 Focus on Augmenting People, Not Replacing Them
Big technological advances are often historically associated with a reduction in staff head count. While reducing labor costs is attractive to business executives, it is likely to create resistance from those whose jobs appear to be at risk. In pursuing this way of thinking, organizations can miss out on real opportunities to use the technology effectively. "We advise our clients that the most transformational benefits of AI in the near term will arise from using it to enable employees to pursue higher-value activities," added Mr. Andrews.
Gartner predicts that by 2020, 20 percent of organizations will dedicate workers to monitoring and guiding neural networks.
"Leave behind notions of vast teams of infinitely duplicable 'smart agents' able to execute tasks just like humans," said Mr. Andrews. "It will be far more productive to engage with workers on the front line. Get them excited and engaged with the idea that AI-powered decision support can enhance and elevate the work they do every day."
As you can notice, I use the term “augment” when referring to the job AI is to perform — that’s because AI’s primary task is to augment human work and support data-driven decision-making, not to replace humans in the workplace. Of course, there are businesses aiming at automating as much as can be automated, but generally speaking, it’s really not AI’s cup of tea. It’s much more into teamwork. What’s more, it has been found that AI and humans joining forces gives better results. In a Harvard Business Review article, authors H. James Wilson and Paul R. Daugherty write:
In our research involving 1,500 companies, we found that firms achieve the most significant performance improvements when humans and machines work together.
However, as a leader, your job in an AI project is to help your staff understand why you’re introducing artificial intelligence and how they should use the insights provided by the model. Without that, you just have fancy, but useless, analytics.
1.5.2. 👨🏫 Make data literacy an organization-wide priority, not just among people within the technology org.
Data literacy is not a technical skill. It is a professional skill. Encourage all of your employees — marketers, sales professionals, operations personnel, product managers, etc. — to develop their data literacy through quarterly engagement sessions that you host, where you cover topics like data-driven decision-making, the art of the possible in AI, how data connects to your business, ethics & AI, or how to communicate using data. This kind of organization-wide emphasis is the basis for a transformation to a data-first culture.
1.5.3. 👨🏫 Develop an internal common language for speaking about data, how it intersects with your business and industry, and how it is changing specific roles at your company.
The world of data is big, filled with buzzwords and misunderstanding. Develop a view as an organization which components of data literacy matter most to your organization — if you are a financial services firm, it may be probability and risk measurement; if you are a technology firm, it may be experimentation and visualization. In your L&D sessions, develop learning content that uses this language and demonstrates how it connects to your business in multiple departments, so employees can connect all the dots between data literacy and their workflows.
1.5.4. 👨🏫 No dataset pitfall
Machine learning is not about solving some random problem that looks commercially appealing. It is all about finding a problem for which a good training dataset can be acquired.
For example, what is harder: speech recognition or OCR? Both tasks look similar at first glance. They both attempt to recognize sequence of words. The only difference is that the first one targets audio stream, while the second one—stream of glyphs. And yet speech recognition is much harder.
The trouble is that it is difficult to build good dataset for speech recognition. The only way to do this is to hire an army of native language speakers and ask them to manually recognize text in a collection of audio files. This process is extremely money- and time-consuming. It can take half a year to build the very first dataset barely enough to be used for training, and much longer to develop it into high-quality, representative dataset that would include corner cases such as rare accents.
This problem is not so severe with OCR. Training dataset for OCR can be automatically constructed without relying on manual labor. The idea is to render sample text into image file and then distort it to represent realworld scanned text: rotate it, blur, add noise, apply brightness gradient, etc. Repeating this process with different texts and fonts allows to create dataset of decent quality in fully automatic mode. Of course, some actually scanned texts are required to make dataset representable, but this is not a bottleneck for ML engineers and can be done later. Machine learning problems can be roughly classified into the following tiers depending on the source of the dataset: naturally existing datasets; programmatically constructed datasets; manually created datasets.
Tasks with naturally existing datasets form the lowest tier. For instance, the problem of detecting sex and age of a person by analyzing photo is exactly of this type. Dataset for this problem can be constructed by scraping social network profiles. Another problem of such type is predicting next word while user is typing text message, provided that an archive of previously sent messages exists. Similar problems are predicting geolocation of the user, stock price forecasting, and recommending movies. Even when data already exists, problems in this category can pose significant challenge because the data can be dirty or biased. People do not always specify their true sex and age in profiles, and recommender systems are prone to bias due to bubble effect. Anyway, naturally existing dirty data is better than no data at all.
Second tier is formed by problems which do not have natural datasets, but datasets can be constructed in automatic mode. One of such problems — OCR — was described above. Datasets for large number of other problems focused on image enhancement can be solved in similar manner, including image colorization, deblurring or upscaling. Dataset construction process consists of downloading a collection of originally good images and then artificially morphing them, e.g. converting to black and white. The goal of the ML model is to reconstruct the original image. Datasets for other problems, such as audio denoising, can be constructed by mixing multiple samples.
The third and highest tier consists of problems which require manual construction of the dataset. Speech recognition, adult content detection, search engine ranking — all these problems require data to be manually labeled. If a problem falls into this tier, it would take 2-10 times longer to solve it than a "similar" problem from other tiers. For example, British National Corpus, which is widely used for training language models, was constructed in three years with input from multiple organizations. It is doubtful that small company would be able to provide such resources. Not only manual labeling is time-consuming but it also requires extra effort from ML engineer. ML engineer has to assemble training dataset piece by piece knowing that any data batch scheduled for labeling costs money. Every new batch must increase overall representativeness of the dataset or otherwise money will be wasted. Selecting such batches is significantly harder than filtering dirty data (tier 1 problem) or creating additional data mutators (tier 2).
When approached with a new ML project, the first question to ask is: how hard will it be to create dataset? People without technical background often forget about data. In fact, more effort is spent on dataset construction than on fitting ML model. Projects which have to rely on manual data labeling are particularly prone to incorrect expectations. Think twice before committing to such projects.
1.5.5. 👨🏫 No classification error allowed
When building a classifier, only one type of error can be minimized.
This is an example of typical conversation between a manager and a tech lead:
Manager. Hello, I have an idea that will earn us a fortune. TechLead. What is it? Manager. There are a lot of prospective customers who are willing to pay for a classifier that is able to mark images with explicit content. Forums, social networks, file hostings — actually all sorts of public web sites where people can upload photos. We will be the first who implement this classifier and will become rich. TechLead. Nope, we won’t be able to create such a classifier. Manager. What are you talking about? I do not see any offensive images in google image search. Surely they can do this, and so do we.
At this moment tech lead has to explain concepts of precision and recall to the manager. It is possible to optimize classifier only for one type of error. Either it will erroneously filter out some good images in addition to bad ones; or it will allow some bad images to pass through. But it is impossible to build a classifier that is free of both errors. In another words, there is a knob in ML model that allows to choose what error is more important. Depending on how you treat "gray" zone, error will shift to the left or to the right. The lower one error type, the higher the another one. If the knob is set to middle position, nothing good will happen because both error types will be higher than zero.
In the hypothetical classifier discussed between manager and tech lead neither error is acceptable. If classifier allows bad images through, then users will be afraid to visit web site in public or allow children to use it. And if it blocks upload of innocent images, users will be frustrated by not understanding what they did wrong. Google, on the other hand, is not so constrained. It can rotate ML knob to a position where all bad images are filtered out along with some good ones. The web is huge. Definitely it will be able to find other images which match query and also pass the classifier. If the classifier even slightly complains about some image, then this image is removed and replaced with another, absolutely innocent one.
Similarly, you may have noticed that a lot of signboards in the google street view are blurred like if they were car plate numbers. The reason is the same: ML knob is in the position where it correctly detects all true plate numbers at the cost of misclassifying some ordinary signboards as plate numbers.
When dealing with spam, the situation is the opposite. It is better to allow some spam messages through antispam filter rather than throwing away important messages. Only when classifier 99.999% sure that message is spam, it is filtered out.
Another example of this kind is an ads engine. Suppose that classifier is able to use location information to predict that user’s car has just broken on the road. Positive classification of some user instructs ads engine to promote nearby service centers and dealerships to this user. It is not a big trouble to misclassify a user and therefore display irrelevant ads if user stopped on the road for some random reason, but it would be inexcusable to miss a click from a person who is really in need of such services. Even if classifier is so bad that four out of five people are marked incorrectly, ads should be displayed.
It should be noted that if neither of error types is acceptable, it is not the end of the world yet. It just means that classifier can’t work in fully automatic mode, but only as a prefilter. Final decision must be made by a real person. In the original example about explicit content identification, classifier may be tuned to make one of the following decisions: a) publish photo immediately if it is sure that photo is OK; b) immediately reject photo and notify user if it is sure that photo is not OK; c) add photo into the queue for manual review in case of uncertainty. The trouble is that it may be prohibitively expensive to pay to reviewers. However, if you manage to make people work for free (usually such people are proudly called "moderators"), then it can work out.
Conclusion. When building classifiers, important question to ask: assuming that cumulative error will be 20%, where would you prefer to rotate the knob to?
1.5.6. 👨🏫 Time limits restrict model quality
Everybody even barely accustomed to machine learning knows that training is a slow process, in particular in case of neural networks. But not everybody realizes that training time is only a small nuisance. The real problem is not the training time, it is running time of already trained model.
For example, imagine neural network that performs so-popular face morphing. If it is applied to a still photo, then there are no hard time limits. Users will patiently wait for a couple of seconds while their face is being morphed, provided that in return they will get some nice visual effect. However, if the same model is to be applied to realtime video stream (e.g. you want to morph your face while running online video chat), time constraints create a challenge. Smartphones produce video streams of 30 frames per second, meaning that model must process every single frame in under 33 milliseconds. This limit is too strict for a model performing high-quality morphing. As such, either the model must be simplified, or video resolution must be lowered, or frame rate must be reduced. Any such simplification reduces output quality.
Issues with performance are not unique to neural networks. Old-school machine learning algorithms also may be affected. For example, web search engines traditionally use ensembles of decision trees to find top N relevant documents to display. These models are computationally significantly cheaper than neural networks, but the trouble is that to build a single result page, model must be applied to thousands of candidate documents. Only top N highestly ranked candidate documents are displayed. Developers, in order to satisfy time limit, have to use various heuristics, the most drastic one is dropping large portion of documents from ranking at all. This is one of the reasons why search engines may display irrelevant results. Relevant documents are actually present in search engine database, but they were skipped by ranking engine because of the time limit.
Another commonly forgotten aspect is time required to preprocess "raw" features into "model" features which can be fed to a model. Raw features (e.g. pixels of the original 3264x2448 photo; or age, sex and salary history in credit score computation) are rarely fed directly into the model. They are typically preprocessed into higher-level features. Different ML algorithms impose different restrictions and thus require different preprocessing steps in order to make model work correctly. In case of photo analysis, processing may include converting photo to blackand-white and downscaling it to fixed dimensions 128x128.
These 16384 pixels are next used directly as input features of neural network. In case of credit score computation preprocessing may consist of generating feature combinations (Age × max(Age, 30) × log(CurrentSalary)), computing statistics ("median of monthly credit payment over the course of last 15 years") and more computationally intense steps like Principal Component Analysis. This preprocessing takes time, which may even exceed time it takes to apply model. Feature preprocessing is generally more expensive in old-school ML models compared to NN. The latter imposes fewer restrictions, which is one of the reasons why NN became popular recently. In all cases preprocessing step does exist and takes time. Bad thing is that this time is not included in reports generated by model training frameworks. Training frameworks work directly with "model" features, which are typically prepared beforehand and supplied in a CSV file. When training is over, frameworks produce report that includes among other things time it takes to apply model to a single sample: "test collection: 1200 samples; score 96.4%; 4 ms/sample". This timing may be misleading since it doesn’t include preprocessing time, which is out of the scope of model training. Therefore it is strongly recommended to perform separate test that will measure timings of end-to-end workflow. Summarizing the above, below is the overall timing formula for computing single "logical" result. Every component of this formula may dominate the others.
Following factors contribute to running time: Complexity of preprocessing raw features Number of features fed into model Complexity of the model itself (number of layers and connections in NN; number of trees and their depth in decision tree models, etc) Number of samples that are fed into model to produce single "logical" result Hardware where model will be run.
There is huge difference between running model on GTX 1080 video card, general purpose Intel/AMD CPU, and smartphone ARM CPU. If you have tight time limits, then a lot of things have to be simplified, therefore reducing model quality. Be prepared for that.
2. Data Understanding
The data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data or to detect interesting subsets to form hypotheses for hidden information.
Machine learning is not about solving some random problem that looks commercially appealing. It is all about finding a problem for which a good training dataset can be acquired.
2.1. Collect data
Acquire within the project the data (or access to the data) listed in the project resources.
2.1.1. Data collection report.
List the dataset (or datasets) acquired, together with their locations within the project, the methods used to acquire them and any problems encountered. Record problems encountered and any solutions achieved to aid with future replication of this project or with the execution of similar future projects.
2.1.2. Data Acquisition
Automate data acquisition and any processes that were necessary to ingest the data.
2.2. Describe data
Examine the “gross” or “surface” properties of the acquired data and report on the results.
2.2.1. Data description report
Describe the data which has been acquired, including: the format of the data, the quantity of data, for example number of records and fields in each table, the identities of the fields and any other surface features of the data which have been discovered. Does the data acquired satisfy the relevant requirements?
2.3. Data exploration
This task tackles the data mining questions, which can be addressed using querying, visualization and reporting. These include: distribution of key attributes, for example the target attribute of a prediction task; relations between pairs or small numbers of attributes; results of simple aggregations; properties of significant sub-populations; simple statistical analyses. These analyses may address directly the data mining goals; they may also contribute to or refine the data description and quality reports and feed into the transformation and other data preparation needed for further analysis.
Describe results of this task including first findings or initial hypothesis and their impact on the remainder of the project. If appropriate, include graphs and plots, which indicate data characteristics or lead to interesting data subsets for further examination.
- Report tables and their relations.
- Check data volume, number of multiples, complexity.
- Check attribute types (numeric, symbolic, taxonomy etc.).
- Check attribute value ranges.
- Analyze attribute correlations.
- Understand the meaning of each attribute and attribute value in business terms.
- For each attribute, compute basic statistics (e.g., compute distribution, average, max, min, standard deviation, variance, mode, skewness, etc.).
- Analyze basic statistics and relate the results to their meaning in business terms.
- Is the attribute relevant for the specific data mining goal?
- Is the attribute meaning used consistently?
- Interview domain expert on his opinion of attribute relevance.
- Is it necessary to balance the data (Depending on the modeling technique used)?
- Analyze key relations.
- Check amount of overlaps of key attribute values across tables.
- Review assumptions/goals.
2.3.1. Verify data quality
Examine the quality of the data, addressing questions such as: is the data complete (does it cover all the cases required)? Is it correct or does it contain errors and if there are errors how common are they? Are there missing values in the data? If so how are they represented, where do they occur and how common are they?
List the results of the data quality verification; if quality problems exist, list possible solutions. Solutions to data quality problems generally depend heavily on both data and business knowledge.
“There are many dimensions of data quality. [. . .] For me, the most important ones are completeness, consistency, and correctness”. Completeness refers to the sparsity of data within each characteristic (i.e., does the data cover the whole range of possible values). Consistency refers to the format and representation of data that should be the same in the dataset. Correctness refers to the degree to which you can rely on the data actually being true. Correctness is strongly influenced by the way how the data was collected.
By that analogy, training data needs testing like code, and a trained ML model needs production practices like a binary does, such as debuggability, rollbacks and monitoring.” The ISO/IEC standard 25012 [30] describes characteristics of data quality. Interestingly, this standard is not as strongly used in RE as its sibling ISO/IEC 25010 [26]
Following the process of individually reviewing 50 papers for selected NFRs, we calculated our agreement using Fleiss’ kappa, a statistical measure for assessing ICR between a fixed number of raters.
2.3.2. Schema Validation
It is useful to encode intuitions about the data in a schema so they can be automatically checked. For example, an adult human is surely between one and ten feet in height. The most common word in English text is probably 'the', with other word frequencies following a power-law distribution.
To construct the schema, one approach is to start with calculating statistics from training data, and then adjusting them as appropriate based on domain knowledge. It may also be useful to start by writing down expectations and then compare them to the data to avoid an anchoring bias. Visualization tools such as Facets can be very useful for analyzing the data to produce the schema.
3. Modeling
3.1. Select modeling technique
As the first step in modeling, select the actual modeling technique that is to be used. Whereas you possibly already selected a tool in business understanding, this task refers to the specific modeling technique, e.g., decision tree building with C4.5 or neural network generation with back propagation. If multiple techniques are applied, perform this task for each technique separately
Modeling technique. Document the actual modeling technique that is to be used.
Modeling assumptions. Many modeling techniques make specific assumptions on the data, e.g., all attributes have uniform distributions, no missing values allowed, class attribute must be symbolic etc. Record any such assumptions made.
3.2. Generate test design
Before we actually build a model, we need to generate a procedure or mechanism to test the model's quality and validity. For example, in supervised data mining tasks such as classification, it is common to use error rates as quality measures for data mining models. Therefore, we typically separate the dataset into train and test set, build the model on the train set and estimate its quality on the separate test set.
Test design. Describe the intended plan for training, testing and evaluating the models. A primary component of the plan is to decide how to divide the available dataset into training data, test data and validation datasets.
3.3. Assess model
The data mining engineer interprets the models according to his domain knowledge, the data mining success criteria and the desired test design. This task interferes with the subsequent evaluation phase. Whereas the data mining engineer judges the success of the application of modeling and discovery techniques more technically, he contacts business analysts and domain experts later in order to discuss the data mining results in the business context. Moreover, this task only considers models whereas the evaluation phase also takes into account all other results that were produced in the course of the project.
3.4. Using version control
Every model specification undergoes a code review and is checked in to a repository: It can be tempting to avoid code review out of expediency, and run experiments based on one's own personal modifications. In addition, when responding to production incidents, it's crucial to know the exact code that was run to produce a given learned model. For example, a responder might need to re-run training with corrected input data, or compare the result of a particular modification. Proper version control of the model specification can help make training auditable and improve reproducibility.
3.5. Training is reproducible
Ideally, training twice on the same data should produce two identical models. Deterministic training dramatically simplifies reasoning about the whole system and can aid auditability and debugging. For example, optimizing feature generation code is a delicate process but verifying that the old and new feature generation code will train to an identical model can provide more confidence that the refactoring was correct. This sort of diff-testing relies entirely on deterministic training. Unfortunately, model training is often not reproducible in practice, especially when working with non-convex methods such as deep learning or even random forests. This can manifest as a change in aggregate metrics across an entire dataset, or, even if the aggregate performance appears the same from run to run, as changes on individual examples. Random number generation is an obvious source of nondeterminism, which can be alleviated with seeding. But even with proper seeding, initialization order can be underspecified so that different portions of the model will be initialized at different times on different runs leading to non-determinism. Furthermore, even when initialization is fully deterministic, multiple threads of execution on a single machine or across a distributed system may be subject to unpredictable orderings of training data, which is another source of non-determinism.
3.6. Monitor for numeric stability
The model is numerically stable: Invalid or implausible numeric values can potentially crop up during model training without triggering explicit errors, and knowing that they have occurred can speed diagnosis of the problem.
How? Explicitly monitor the initial occurrence of any NaNs or infinities. Set plausible bounds for weights and the fraction of ReLU units in a layer returning zero values, and trigger alerts during training if these exceed appropriate thresholds.
3.7. Prepare micro training set for testing
Apart from having a training / test dataset split, it is extremely useful to have a tiny set of data that allows quicker model iteration and testing.
3.8. Pipeline allows debugging
The model should allow debugging by observing the step-by-step computation of training or inference on a single example: When someone finds a case where a model is behaving bizarrely, how difficult is it to figure out why? Is there an easy, well documented process for feeding a single example to the model and investigating the computation through each stage of the model. This is especially important when the model is deployed in production, and a user reports a bug. If the model is not easily debuggable, it can be difficult to figure out what went wrong, and even harder to fix it.
3.9. Test model specification code
Model specification code is unit tested: Although model specifications may seem like “configuration”, such files can have bugs and need to be tested. Unfortunately, testing a model specification can be very hard. Unit tests should run quickly and require no external dependencies but model training is often a very slow process that involves pulling in lots of data from many sources.
How? It's useful to distinguish two kinds of model tests: tests of API usage and tests of algorithmic correctness. We plan to release an open source framework implementing some of these tests soon.
ML APIs can be complex, and code using them can be wrong in subtle ways. Even if code errors would be apparent after training (due to a model that fails to train or results in poor performance), training is expensive and so the development loop is slow. We have found in practice that a simple unit test to generate random input data, and train the model for a single step of gradient descent is quite powerful for detecting a host of common library mistakes, resulting in a much faster development cycle. Another useful assertion is that a model can restore from a checkpoint after a mid-training job crash.
Testing correctness of a novel implementation of an ML algorithm is more difficult, but still necessary - it is not sufficient that code produces a model with high quality predictions, but that it does so for the expected reasons. One solution is to make assertions that specific subcomputations of the algorithm are correct, e.g. that a specific part of an RNN was executed exactly once per element of the input sequence. Another solution involves not training to completion in the unit test but only training for a few iterations and verifying that loss decreases with training. Still another is to purposefully train a model for overfitting: if one can get a model to effectively memorize its training data, then that provides some confidence that learning reliably happens. When testing models, pains should be taken to avoid “golden tests”, i.e., tests that partially train a model and compare the results to a previously generated model - such tests are difficult to maintain over time without blindly updating the golden file. In addition to problems in training non-determinism, when these tests do break they provide very little insight into how or why. Additionally, flaky tests remain a real danger here.
3.10. Test model against a baseline
A simpler model is not better: Regularly testing against a very simple baseline model, such as a linear model with very few features, is an effective strategy both for confirming the functionality of the larger pipeline and for helping to assess the cost to benefit tradeoffs of more sophisticated techniques.
3.11. Test features for predictive power
All features are beneficial: A kitchen-sink approach to features can be tempting, but every feature added has a software engineering cost. Hence, it's important to understand the value each feature provides in additional predictive power (independent of other features). This is particularly useful for explainability – models with fewer features are easier to understand for humans.
How? Some ways to run this test are by computing correlation coefficients, by training models with one or two features, or by training a set of models that each have one of k features individually removed.
3.12. Test model on data slices
Model quality is sufficient on all important data slices: Slicing a data set along certain dimensions of interest can improve fine-grained understanding of model quality. Slices should distinguish subsets of the data that might behave qualitatively differently, for example, users by country, users by frequency of use, or movies by genre.
Examining sliced data avoids having fine-grained quality issues masked by a global summary metric, e.g. global accuracy improved by 1% but accuracy for one country dropped by 50%. This class of problems often arises from a fault in the collection of training data, that caused an important set of training data to be lost or late. How? Consider including these tests in your release process, e.g. release tests for models can impose absolute thresholds (e.g., error for slice x must be less than 5%), to catch large drops in quality, as well as incremental (e.g. the change in error for slice x must be less than 1% compared to the previously released model).
3.13. Enforce privacy and security constraints
Features adhere to meta-level requirements: Your project may impose requirements on the data coming in to the system. It might prohibit features derived from user data, prohibit the use of specific features like age, or simply prohibit any feature that is deprecated. It might require all features be available from a single source. However, during model development and experimentation, it is typical to try out a wide variety of potential features to improve prediction quality.
How? Programmatically enforce these requirements, so that all models in production properly adhere to them.
The data pipeline has appropriate privacy controls: Training data, validation data, and vocabulary files all have the potential to contain sensitive user data. While teams often are aware of the need to remove personally identifiable information (PII), during this type of exporting and transformations, programming errors and system changes can lead to inadvertent PII leakages that may have serious consequences.
How? Make sure to budget sufficient time during new feature development that depends on sensitive data to allow for proper handling. Test that access to pipeline data is controlled as tightly as the access to raw user data, especially for data sources that haven't previously been used in ML. Finally, test that any user-requested data deletion propagates to the data in the ML training pipeline, and to any learned models.
3.14. Test feature code
All input feature code is tested: Feature creation code may appear simple enough to not need unit tests, but this code is crucial for correct behavior and so its continued quality is vital. Bugs in features may be almost impossible to detect once they have entered the data generation process, especially if they are represented in both training and test data.
3.15. Test for training-serving skew
Training-serving skew is much more related to the preprocessing step than the model or modeling itself. We all know how training data is sourced from the batch files or may be from databases directly or some kind of storage that release the data in batches. The first thing after sourcing is to preprocess this data. Now, think of where prediction data is sourced from. Models have to make prediction on almost every incoming request and this is totally different situation than training. This means prediction data is sourced from streaming pipelines. But this is not the problem. Instead the problem is how we process this data. More often, preprocessing on prediction data from streaming pipelines is done in an ad-hoc manner with many shortcuts.
Needless to say, the best way to solve this is to ensure batch and streaming data should be processed in the same manner using the same pipeline. Cleaning, transformation and all relevant tasks should be same for training and testing data. In fact this is so crucial that there are different architectures to ensure this necessity: Lambda architecture and Kappa architecture. Both of these architectures should be considered before deploying and select the one that suit your needs.
3.16. Tune all hyperparameters
All hyperparameters have been tuned: A ML model can often have multiple hyperparameters, such as learning rates, number of layers, layer sizes and regularization coefficients. Choice of the hyperparameter values can have dramatic impact on prediction quality.
How? Methods such as a grid search or a more sophisticated hyperparameter search strategy not only improve prediction quality, but also can uncover hidden reliability issues. Substantial performance improvements have been realized in many ML systems through use of an internal hyperparameter tuning service.
3.17. Test for fairness
The model has been tested for considerations of inclusion: There have been a number of recent studies on the issue of ML Fairness [14], [15], which may arise inadvertently due to factors such as choice of training data. For example, Bolukbasi et al. found that a word embedding trained on news articles had learned some striking associations between gender and occupation that may have reflected the content of the news articles but which may have been inappropriate for use in a predictive modeling context. This form of potentially overlooked biases in training data sets may then influence the larger system behavior. How? Diagnosing such issues is an important step for creating robust modeling systems that serve all users well. Tests that can be run include examining input features to determine if they correlate strongly with protected user categories, and slicing predictions to determine if prediction outputs differ materially when conditioned on different user groups.
3.18. Plan for data drift and model staleness
The impact of model staleness is known: Many production ML systems encounter rapidly changing, non-stationary data. Examples include content recommendation systems and financial ML applications. For such systems, if the pipeline fails to train and deploy sufficiently up-to-date models, we say the model is stale. Understanding how model staleness affects the quality of predictions is necessary to determine how frequently to update the model. If predictions are based on a model trained yesterday versus last week versus last year, what is the impact on the live metrics of interest? Most models need to be updated eventually to account for changes in the external world; a careful assessment is important to decide how often to perform the updates.
How? One way of testing the impact of staleness is with a small A/B experiment with older models. Testing a range of ages can provide an age-versus-quality curve to help understand what amount of staleness is tolerable.
4. Initial Evaluation
At this stage in the project you have built a model (or models) that appears to have high quality from a data analysis perspective. Before proceeding to final deployment of the model, it is important to more thoroughly evaluate the model and review the steps executed to construct the model to be certain it properly achieves the business objectives. A key objective is to determine if there is some important business issue that has not been sufficiently considered. At the end of this phase, a decision on the use of the data mining results should be reached.
Operationalizing model evaluation requires an active organizational effort (Section 4.4). Popular model evaluation “best practices” do not do justice to the rigor with which organizations think about deployments: they generally focus on using one typically-static held-out dataset to evaluate the model on [38] and a single ML metric choice (e.g., precision, recall) [1, 2]. We find that MLEs invest significant resources in maintaining multiple up-todate evaluation datasets and metrics over time—especially ensuring that data sub-populations of interest are adequately covered.
Validation. Since errors become more expensive to handle when users see them, it's good to test changes, prune bad ideas, and proactively monitor pipelines for bugs as early as possible (P1, P2, P5, P6, P7, P10, P14, P15, P18). P1 said: “The general theme, as we moved up in maturity, is: how do you do more of the validation earlier, so the iteration cycle is faster?”
4.1. Monitoring
Due to the dependency between the behavior of an ML system and the data it has been trained on, it is crucial to define actions that ensure that training data actually corresponds to real data. Since data characteristics in reality may change over time, requirements validation becomes an activity that needs to be performed continuously during system operation. Our interviewees agreed that monitoring and analysis of runtime data is essential for maintaining the performance of the ML system. They also agreed that ML systems need to be retrained regularly to adjust to recent data. By analyzing the problem domain, a requirements engineer should specify when and how often retraining is necessary. A requirements engineer should also specify conditions for data anomalies that may potentially lead to unreasonable behavior of the ML system during runtime.
4.2. Evaluate results
Previous evaluation steps dealt with factors such as the accuracy and generality of the model. This step assesses the degree to which the model meets the business objectives and seeks to determine if there is some business reason why this model is deficient. Another option of evaluation is to test the model(s) on test applications in the real application if time and budget constraints permit.
Moreover, evaluation also assesses other data mining results generated. Data mining results cover models which are necessarily related to the original business objectives and all other findings which are not necessarily related to the original business objectives but might also unveil additional challenges, information or hints for future directions.
Assessment of data mining results with respect to business success criteria. Summarize assessment results in terms of business success criteria including a final statement whether the project already meets the initial business objectives.
5. Deployment
Creation of the model is generally not the end of the project. Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process across the enterprise. In many cases it is the customer, not the data analyst, who carries out the deployment steps.
5.1. Deployment plan
In order to deploy the data mining result(s) into the business, this task takes the evaluation results and concludes a strategy for deployment. If a general procedure has been identified to create the relevant model(s), this procedure is documented here for later deployment. Summarize deployment strategy including necessary steps and how to perform them.
5.2. Verify model quality prior to deployment
Model quality must be validated before attempting to serve it: After a model is trained but before it actually affects real traffic, an automated system needs to inspect it and verify that its quality is sufficient; that system must either bless the model or veto it, terminating its entry to the production environment.
How? It is important to test for both slow degradations in quality over many versions as well as sudden drops in a new version. For the former, setting loose thresholds and comparing against predictions on a validation set can be useful; for the latter, it is useful to compare predictions to the previous version of the model while setting tighter thresholds.
5.3. Integration test the entire pipeline
The full ML pipeline is integration tested: A complete ML pipeline typically consists of assembling training data, feature generation, model training, model verification, and deployment to a serving system. Although a single engineering team may be focused on a small part of the process, each stage can introduce errors that may affect subsequent stages, possibly even several stages away. That means there must be a fully automated test that runs regularly and exercises the entire pipeline, validating that data and code can successfully move through each stage and that the resulting model performs well.
How? The integration test should run both continuously as well as with new releases of models or servers, in order to catch problems well before they reach production. Faster running integration tests with a subset of training data or a simpler model can give faster feedback to developers while still backed by less frequent, long running versions with a setup that more closely mirrors production.
5.4. Canary testing of the models
Models are tested via a canary process before they enter production serving environments: Offline testing, however extensive, cannot by itself guarantee the model will perform well in live production settings, as the real world often contains significant non-stationarity or other issues that limit the utility of historical data. Consequently, there is always some risk when turning on a new model in production.
One recurring problem that canarying can help catch is mismatches between model artifacts and serving infrastructure. Modeling code can change more frequently than serving code, so there is a danger that an older serving system will not be able to serve a model trained from newer code. For example, as shown in Figure 2, a refactoring in the core learning library might change the low-level implementation of an operation Op in the model from Op0.1 to a more efficient implementation, Op0.2. A newly trained model will thus expect to be implemented with Op0.2; an older deployed server will not include Op0.2 and so will refuse to load the model.
How? To mitigate the mismatch issue, one approach is testing that a model successfully loads into production serving binaries and that inference on production input data succeeds. To mitigate the new-model risk more generally, one can turn up new models gradually, running old and new models concurrently, with new models only seeing a small fraction of traffic, gradually increased as the new model is observed to behave sanely.
5.5. Have a rollback system
Models can be quickly and safely rolled back to a previous serving version: A model “roll back” procedure is a key part of incident response to many of the issues that can be detected by the monitoring discussed in Section V. Being able to quickly revert to a previous known-good state is as crucial with ML models as with any other aspect of a serving system. Because rolling back is an emergency procedure, operators should practice doing it normally, when not in emergency conditions.
5.6. Plan monitoring and maintenance
Monitoring and maintenance are important issues if the data mining result becomes part of the day-to-day business and its environment. A careful preparation of a maintenance strategy helps to avoid unnecessarily long periods of incorrect usage of data mining results. In order to monitor the deployment of the data mining result(s), the project needs a detailed plan on the monitoring process. This plan takes into account the specific type of deployment.
It is crucial to know not just that your ML system worked correctly at launch, but that it continues to work correctly over time. An ML system by definition is making predictions on previously unseen data, and typically also incorporates new data over time into training. The standard approach is to monitor the system, i.e. to have a constantly-updated “dashboard” user interface displaying relevant graphs and statistics, and to automatically alert the engineering team when particular metrics deviate significantly from expectations. For ML systems, it is important to monitor serving systems, training pipelines, and input data. Here we recommend specific metrics to monitor throughout the system. The usual sorts of incident response approaches will apply; one unique to ML is to roll back not the system code but the learned model, hence our test earlier (test Infra 7) to regularly ensure that this process is safe and easy
ML Pipeline Monitoring and Response. Monitoring ML pipelines and responding to bugs involve tracking live metrics (via queries or dashboards), slicing and dicing sub-populations to investigate prediction quality, patching the model with non-ML heuristics for known failure modes, and finding in-the-wild failures and adding them to the evaluation set.
5.6.2. Collect and validate against real KPIs
Offline proxy metrics correlate with actual online impact metrics: A user-facing production system's impact is judged by metrics of engagement, user happiness, revenue, and so forth. A machine learning system is trained to optimize loss metrics such as log-loss or squared error. A strong understanding of the relationship between these offline proxy metrics and the actual impact metrics is needed to ensure that a better scoring model will result in a better production system.
How? The offline/online metric relationship can be measured in one or more small scale A/B experiments using an intentionally degraded model.
5.6.3. Monitor for performance regressions
The model has not experienced a dramatic or slow-leak regressions in training speed, serving latency, throughput, or RAM usage: The computational performance (as opposed to predictive quality) of an ML system is often a key concern at scale. Deep neural networks can be slow to train and run inference on, wide linear models with feature crosses can use a lot of memory; any ML model may take days to train; and so forth. Swiftly reacting to changes in this performance due to changes in data, features, modeling, or underlying compute library or infrastructure is crucial to maintaining a performant system.
How? While measuring computational performance is a standard part of any monitoring, it is useful to slice performance metrics not just by the versions and components of code, but also by data and model versions. Degradations in computational performance may occur with dramatic changes (for which comparison to performance of prior versions or time slices can be helpful for detection) or in slow leaks (for which a pre-set alerting threshold can be helpful for detection)
5.6.4. Monitor for quality regressions
• Online measurement of accuracy: Just as you need to know the latency of your website and public application programming interfaces, you need to know how accurate your models are in production. How many predictions actually came true? This requires collecting and logging real-use results but is an elementary requirement.
The model has not experienced a regression in prediction quality on served data: Validation data will always be older than real serving input data, so measuring a model’s quality on that validation data before pushing it to serving is only an estimate of quality metrics on actual live serving inputs. However, it is not always possible to know the correct labels even shortly after serving time, making quality measurement difficult.
How? Here are some options to make sure that there is no degradation in served prediction quality due to changes in data, differing codepaths, etc. Measure statistical bias in predictions, i.e. the average of predictions in a particular slice of data. Generally speaking, models should have zero bias, in aggregate and on slices (e.g. 90% of predictions of probability 0.9 should in fact be positive). Knowing that a model is unbiased is not enough to know it is any good, but knowing there is bias can be a useful canary to detect problems. • In some tasks, the label actually is available immediately or soon after the prediction is made (e.g. will a user click on an ad). In this case, we can judge the quality of predictions in almost real-time and identify problems quickly. • Finally, it can be useful to periodically add new training data by having human raters manually annotate labels for logged serving inputs. Some of this data can be held out to validate the served predictions. However the measure can be done, thresholds must be set as to acceptable quality (e.g. based on bounds of quality at the launch of the initial system), and then a responder should be notified immediately if quality drifts outside that threshold. As with computational performance, it is crucial to monitor both dramatic and slow-leak regressions in prediction quality.
5.6.5. Monitor for data drift
• Mind the gap: That is, watch out for gaps between the distributions of your training and online data sets. This is a simple-to-measure, effective-in-practice heuristic that uncovers a variety of issues. If your training data has 50% high-risk patients, but in production, you're predicting only 30% as high-risk, it's probably time to retrain.
• Online data quality alerts: If the number or ratio of the input data changes in an unexpected way, an alert should go to your operations team. Are you patients suddenly older, more female or less diabetic? If you haven't trained your model on those types of patients, you may be serving bad predictions.
5.6.6. Monitoring and maintenance plan.
Summarize monitoring and maintenance strategy including necessary steps and how to perform them.
References
Why ML Models Rarely Reach Production and What You Can Do About it
Why Machine Learning Models Crash And Burn In Production
Full Stack Data Science
Why your Machine Learning model may not work in production?
These Anti-Patterns are Slowing AI Adoption in Enterprises in 2020
Why AI is Challenging in Healthcare
Top 10 Reasons Why AI Projects Fail
Why AI investments fail to deliver
The One Practice That Is Separating The AI Successes From The Failures
Gartner: 85% of AI implementations will fail by 2022
Gartner Predicts Half of Finance AI Projects Will Be Delayed or Cancelled by 2024
Why AI Projects Fail
Top Reasons Why AI Projects Fail
How Data-Literate Is Your Company?
4 Reasons for Artificial Intelligence (AI) Project Failure
Gartner Says Nearly Half of CIOs Are Planning to Deploy Artificial Intelligence
Five Common AI/ML Project Mistakes
Overcoming the C-Suite’s Distrust of AI
Want your company’s A.I. project to succeed? Don’t hand it to the data scientists, says this CEO
The single most important reason why AI projects fail
The Top 5 Reasons Why Most AI Projects Fail
Why 85% of AI projects fail
Beyond the hype: A guide to understanding and successfully implementing artificial intelligence within your business
Twitter taught Microsoft’s AI chatbot to be a racist asshole in less than a day
Google 'fixed' its racist algorithm by removing gorillas from its image-labeling tech
Google Photos Tags Two African Americans As Gorillas Through Facial Recognition Software
Why AI & Machine Learning Projects Fail?
Why Machine Learning Projects Fail Part 2
Why AI investments fail to deliver
Why Nearly 90% of Machine Learning Projects Fail
4 Reasons Why Your Machine Learning Project Could Fail
Here is the list of top10 reasons why large-scale machine learning projects fail
Common Reasons Why Machine Learning Projects Fail
Why Machine Learning Projects Fail
Top 5 Reasons Why Machine Learning Projects Fail
Why Machine Learning Projects Fail and How to Make Sure They Don't
Our Top Data and Analytics Predicts for 2019
White Paper: What Data Scientists Tell Us About AI Model Training Today
IDC Survey Finds Artificial Intelligence to be a Priority for Organizations But Few Have Implemented an Enterprise-Wide Strategy
Requirements Engineering for Machine Learning: A Review and Reflection
The CRISP-DM user guide
A Catalogue of Concerns for Specifying Machine Learning-Enabled Systems
The ML test score: A rubric for ML production readiness and technical debt reduction
How Data-Literate Is Your Company?
MHRA/FDA Principles of Good Machine Learning Practice