1. Business Understanding

2. Data Understanding

The data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data or to detect interesting subsets to form hypotheses for hidden information.

Machine learning is not about solving some random problem that looks commercially appealing. It is all about finding a problem for which a good training dataset can be acquired.

2.1. Collect data

Acquire within the project the data (or access to the data) listed in the project resources.

2.1.1. Data collection report.

List the dataset (or datasets) acquired, together with their locations within the project, the methods used to acquire them and any problems encountered. Record problems encountered and any solutions achieved to aid with future replication of this project or with the execution of similar future projects.

2.1.2. Data Acquisition

Automate data acquisition and any processes that were necessary to ingest the data.

2.2. Describe data

Examine the “gross” or “surface” properties of the acquired data and report on the results.

2.2.1. Data description report

Describe the data which has been acquired, including: the format of the data, the quantity of data, for example number of records and fields in each table, the identities of the fields and any other surface features of the data which have been discovered. Does the data acquired satisfy the relevant requirements?

2.3. Data exploration

This task tackles the data mining questions, which can be addressed using querying, visualization and reporting. These include: distribution of key attributes, for example the target attribute of a prediction task; relations between pairs or small numbers of attributes; results of simple aggregations; properties of significant sub-populations; simple statistical analyses. These analyses may address directly the data mining goals; they may also contribute to or refine the data description and quality reports and feed into the transformation and other data preparation needed for further analysis.

Describe results of this task including first findings or initial hypothesis and their impact on the remainder of the project. If appropriate, include graphs and plots, which indicate data characteristics or lead to interesting data subsets for further examination.

Report tables and their relations.
Check data volume, number of multiples, complexity.
Check attribute types (numeric, symbolic, taxonomy etc.).
Check attribute value ranges.
Analyze attribute correlations.
Understand the meaning of each attribute and attribute value in business terms.
For each attribute, compute basic statistics (e.g., compute distribution, average, max, min, standard deviation, variance, mode, skewness, etc.).
Analyze basic statistics and relate the results to their meaning in business terms.
Is the attribute relevant for the specific data mining goal?
Is the attribute meaning used consistently?
Interview domain expert on his opinion of attribute relevance.
Is it necessary to balance the data (Depending on the modeling technique used)?
Analyze key relations.
Check amount of overlaps of key attribute values across tables.
Review assumptions/goals.

2.3.1. Verify data quality

Examine the quality of the data, addressing questions such as: is the data complete (does it cover all the cases required)? Is it correct or does it contain errors and if there are errors how common are they? Are there missing values in the data? If so how are they represented, where do they occur and how common are they?

2.3.1.1. Data quality report

List the results of the data quality verification; if quality problems exist, list possible solutions. Solutions to data quality problems generally depend heavily on both data and business knowledge.

“There are many dimensions of data quality. [. . .] For me, the most important ones are completeness, consistency, and correctness”. Completeness refers to the sparsity of data within each characteristic (i.e., does the data cover the whole range of possible values). Consistency refers to the format and representation of data that should be the same in the dataset. Correctness refers to the degree to which you can rely on the data actually being true. Correctness is strongly influenced by the way how the data was collected.

By that analogy, training data needs testing like code, and a trained ML model needs production practices like a binary does, such as debuggability, rollbacks and monitoring.” The ISO/IEC standard 25012 [30] describes characteristics of data quality. Interestingly, this standard is not as strongly used in RE as its sibling ISO/IEC 25010 [26]

Following the process of individually reviewing 50 papers for selected NFRs, we calculated our agreement using Fleiss’ kappa, a statistical measure for assessing ICR between a fixed number of raters.

2.3.2. Schema Validation

It is useful to encode intuitions about the data in a schema so they can be automatically checked. For example, an adult human is surely between one and ten feet in height. The most common word in English text is probably 'the', with other word frequencies following a power-law distribution.

To construct the schema, one approach is to start with calculating statistics from training data, and then adjusting them as appropriate based on domain knowledge. It may also be useful to start by writing down expectations and then compare them to the data to avoid an anchoring bias. Visualization tools such as Facets can be very useful for analyzing the data to produce the schema.

3. Modeling

3.1. Select modeling technique

As the first step in modeling, select the actual modeling technique that is to be used. Whereas you possibly already selected a tool in business understanding, this task refers to the specific modeling technique, e.g., decision tree building with C4.5 or neural network generation with back propagation. If multiple techniques are applied, perform this task for each technique separately

Modeling technique. Document the actual modeling technique that is to be used.

Modeling assumptions. Many modeling techniques make specific assumptions on the data, e.g., all attributes have uniform distributions, no missing values allowed, class attribute must be symbolic etc. Record any such assumptions made.

3.2. Generate test design

Before we actually build a model, we need to generate a procedure or mechanism to test the model's quality and validity. For example, in supervised data mining tasks such as classification, it is common to use error rates as quality measures for data mining models. Therefore, we typically separate the dataset into train and test set, build the model on the train set and estimate its quality on the separate test set.

Test design. Describe the intended plan for training, testing and evaluating the models. A primary component of the plan is to decide how to divide the available dataset into training data, test data and validation datasets.

3.3. Assess model

The data mining engineer interprets the models according to his domain knowledge, the data mining success criteria and the desired test design. This task interferes with the subsequent evaluation phase. Whereas the data mining engineer judges the success of the application of modeling and discovery techniques more technically, he contacts business analysts and domain experts later in order to discuss the data mining results in the business context. Moreover, this task only considers models whereas the evaluation phase also takes into account all other results that were produced in the course of the project.

3.4. Using version control

Every model specification undergoes a code review and is checked in to a repository: It can be tempting to avoid code review out of expediency, and run experiments based on one's own personal modifications. In addition, when responding to production incidents, it's crucial to know the exact code that was run to produce a given learned model. For example, a responder might need to re-run training with corrected input data, or compare the result of a particular modification. Proper version control of the model specification can help make training auditable and improve reproducibility.

3.5. Training is reproducible

Ideally, training twice on the same data should produce two identical models. Deterministic training dramatically simplifies reasoning about the whole system and can aid auditability and debugging. For example, optimizing feature generation code is a delicate process but verifying that the old and new feature generation code will train to an identical model can provide more confidence that the refactoring was correct. This sort of diff-testing relies entirely on deterministic training. Unfortunately, model training is often not reproducible in practice, especially when working with non-convex methods such as deep learning or even random forests. This can manifest as a change in aggregate metrics across an entire dataset, or, even if the aggregate performance appears the same from run to run, as changes on individual examples. Random number generation is an obvious source of nondeterminism, which can be alleviated with seeding. But even with proper seeding, initialization order can be underspecified so that different portions of the model will be initialized at different times on different runs leading to non-determinism. Furthermore, even when initialization is fully deterministic, multiple threads of execution on a single machine or across a distributed system may be subject to unpredictable orderings of training data, which is another source of non-determinism.

3.6. Monitor for numeric stability

The model is numerically stable: Invalid or implausible numeric values can potentially crop up during model training without triggering explicit errors, and knowing that they have occurred can speed diagnosis of the problem.

How? Explicitly monitor the initial occurrence of any NaNs or infinities. Set plausible bounds for weights and the fraction of ReLU units in a layer returning zero values, and trigger alerts during training if these exceed appropriate thresholds.

3.7. Prepare micro training set for testing

Apart from having a training / test dataset split, it is extremely useful to have a tiny set of data that allows quicker model iteration and testing.

3.8. Pipeline allows debugging

The model should allow debugging by observing the step-by-step computation of training or inference on a single example: When someone finds a case where a model is behaving bizarrely, how difficult is it to figure out why? Is there an easy, well documented process for feeding a single example to the model and investigating the computation through each stage of the model. This is especially important when the model is deployed in production, and a user reports a bug. If the model is not easily debuggable, it can be difficult to figure out what went wrong, and even harder to fix it.

3.9. Test model specification code

Model specification code is unit tested: Although model specifications may seem like “configuration”, such files can have bugs and need to be tested. Unfortunately, testing a model specification can be very hard. Unit tests should run quickly and require no external dependencies but model training is often a very slow process that involves pulling in lots of data from many sources.

How? It's useful to distinguish two kinds of model tests: tests of API usage and tests of algorithmic correctness. We plan to release an open source framework implementing some of these tests soon.

ML APIs can be complex, and code using them can be wrong in subtle ways. Even if code errors would be apparent after training (due to a model that fails to train or results in poor performance), training is expensive and so the development loop is slow. We have found in practice that a simple unit test to generate random input data, and train the model for a single step of gradient descent is quite powerful for detecting a host of common library mistakes, resulting in a much faster development cycle. Another useful assertion is that a model can restore from a checkpoint after a mid-training job crash.

Testing correctness of a novel implementation of an ML algorithm is more difficult, but still necessary - it is not sufficient that code produces a model with high quality predictions, but that it does so for the expected reasons. One solution is to make assertions that specific subcomputations of the algorithm are correct, e.g. that a specific part of an RNN was executed exactly once per element of the input sequence. Another solution involves not training to completion in the unit test but only training for a few iterations and verifying that loss decreases with training. Still another is to purposefully train a model for overfitting: if one can get a model to effectively memorize its training data, then that provides some confidence that learning reliably happens. When testing models, pains should be taken to avoid “golden tests”, i.e., tests that partially train a model and compare the results to a previously generated model - such tests are difficult to maintain over time without blindly updating the golden file. In addition to problems in training non-determinism, when these tests do break they provide very little insight into how or why. Additionally, flaky tests remain a real danger here.

3.10. Test model against a baseline

A simpler model is not better: Regularly testing against a very simple baseline model, such as a linear model with very few features, is an effective strategy both for confirming the functionality of the larger pipeline and for helping to assess the cost to benefit tradeoffs of more sophisticated techniques.

3.11. Test features for predictive power

All features are beneficial: A kitchen-sink approach to features can be tempting, but every feature added has a software engineering cost. Hence, it's important to understand the value each feature provides in additional predictive power (independent of other features). This is particularly useful for explainability – models with fewer features are easier to understand for humans.

How? Some ways to run this test are by computing correlation coefficients, by training models with one or two features, or by training a set of models that each have one of k features individually removed.

3.12. Test model on data slices

Model quality is sufficient on all important data slices: Slicing a data set along certain dimensions of interest can improve fine-grained understanding of model quality. Slices should distinguish subsets of the data that might behave qualitatively differently, for example, users by country, users by frequency of use, or movies by genre.

Examining sliced data avoids having fine-grained quality issues masked by a global summary metric, e.g. global accuracy improved by 1% but accuracy for one country dropped by 50%. This class of problems often arises from a fault in the collection of training data, that caused an important set of training data to be lost or late. How? Consider including these tests in your release process, e.g. release tests for models can impose absolute thresholds (e.g., error for slice x must be less than 5%), to catch large drops in quality, as well as incremental (e.g. the change in error for slice x must be less than 1% compared to the previously released model).

3.13. Enforce privacy and security constraints

Features adhere to meta-level requirements: Your project may impose requirements on the data coming in to the system. It might prohibit features derived from user data, prohibit the use of specific features like age, or simply prohibit any feature that is deprecated. It might require all features be available from a single source. However, during model development and experimentation, it is typical to try out a wide variety of potential features to improve prediction quality.

How? Programmatically enforce these requirements, so that all models in production properly adhere to them.

The data pipeline has appropriate privacy controls: Training data, validation data, and vocabulary files all have the potential to contain sensitive user data. While teams often are aware of the need to remove personally identifiable information (PII), during this type of exporting and transformations, programming errors and system changes can lead to inadvertent PII leakages that may have serious consequences.

How? Make sure to budget sufficient time during new feature development that depends on sensitive data to allow for proper handling. Test that access to pipeline data is controlled as tightly as the access to raw user data, especially for data sources that haven't previously been used in ML. Finally, test that any user-requested data deletion propagates to the data in the ML training pipeline, and to any learned models.

3.14. Test feature code

All input feature code is tested: Feature creation code may appear simple enough to not need unit tests, but this code is crucial for correct behavior and so its continued quality is vital. Bugs in features may be almost impossible to detect once they have entered the data generation process, especially if they are represented in both training and test data.

3.15. Test for training-serving skew

Training-serving skew is much more related to the preprocessing step than the model or modeling itself. We all know how training data is sourced from the batch files or may be from databases directly or some kind of storage that release the data in batches. The first thing after sourcing is to preprocess this data. Now, think of where prediction data is sourced from. Models have to make prediction on almost every incoming request and this is totally different situation than training. This means prediction data is sourced from streaming pipelines. But this is not the problem. Instead the problem is how we process this data. More often, preprocessing on prediction data from streaming pipelines is done in an ad-hoc manner with many shortcuts.

Needless to say, the best way to solve this is to ensure batch and streaming data should be processed in the same manner using the same pipeline. Cleaning, transformation and all relevant tasks should be same for training and testing data. In fact this is so crucial that there are different architectures to ensure this necessity: Lambda architecture and Kappa architecture. Both of these architectures should be considered before deploying and select the one that suit your needs.

3.16. Tune all hyperparameters

All hyperparameters have been tuned: A ML model can often have multiple hyperparameters, such as learning rates, number of layers, layer sizes and regularization coefficients. Choice of the hyperparameter values can have dramatic impact on prediction quality.

How? Methods such as a grid search or a more sophisticated hyperparameter search strategy not only improve prediction quality, but also can uncover hidden reliability issues. Substantial performance improvements have been realized in many ML systems through use of an internal hyperparameter tuning service.

3.17. Test for fairness

The model has been tested for considerations of inclusion: There have been a number of recent studies on the issue of ML Fairness [14], [15], which may arise inadvertently due to factors such as choice of training data. For example, Bolukbasi et al. found that a word embedding trained on news articles had learned some striking associations between gender and occupation that may have reflected the content of the news articles but which may have been inappropriate for use in a predictive modeling context. This form of potentially overlooked biases in training data sets may then influence the larger system behavior. How? Diagnosing such issues is an important step for creating robust modeling systems that serve all users well. Tests that can be run include examining input features to determine if they correlate strongly with protected user categories, and slicing predictions to determine if prediction outputs differ materially when conditioned on different user groups.

3.18. Plan for data drift and model staleness

The impact of model staleness is known: Many production ML systems encounter rapidly changing, non-stationary data. Examples include content recommendation systems and financial ML applications. For such systems, if the pipeline fails to train and deploy sufficiently up-to-date models, we say the model is stale. Understanding how model staleness affects the quality of predictions is necessary to determine how frequently to update the model. If predictions are based on a model trained yesterday versus last week versus last year, what is the impact on the live metrics of interest? Most models need to be updated eventually to account for changes in the external world; a careful assessment is important to decide how often to perform the updates.

How? One way of testing the impact of staleness is with a small A/B experiment with older models. Testing a range of ages can provide an age-versus-quality curve to help understand what amount of staleness is tolerable.

4. Initial Evaluation

At this stage in the project you have built a model (or models) that appears to have high quality from a data analysis perspective. Before proceeding to final deployment of the model, it is important to more thoroughly evaluate the model and review the steps executed to construct the model to be certain it properly achieves the business objectives. A key objective is to determine if there is some important business issue that has not been sufficiently considered. At the end of this phase, a decision on the use of the data mining results should be reached.

Operationalizing model evaluation requires an active organizational effort (Section 4.4). Popular model evaluation “best practices” do not do justice to the rigor with which organizations think about deployments: they generally focus on using one typically-static held-out dataset to evaluate the model on [38] and a single ML metric choice (e.g., precision, recall) [1, 2]. We find that MLEs invest significant resources in maintaining multiple up-todate evaluation datasets and metrics over time—especially ensuring that data sub-populations of interest are adequately covered.

Validation. Since errors become more expensive to handle when users see them, it's good to test changes, prune bad ideas, and proactively monitor pipelines for bugs as early as possible (P1, P2, P5, P6, P7, P10, P14, P15, P18). P1 said: “The general theme, as we moved up in maturity, is: how do you do more of the validation earlier, so the iteration cycle is faster?”

4.1. Monitoring

Due to the dependency between the behavior of an ML system and the data it has been trained on, it is crucial to define actions that ensure that training data actually corresponds to real data. Since data characteristics in reality may change over time, requirements validation becomes an activity that needs to be performed continuously during system operation. Our interviewees agreed that monitoring and analysis of runtime data is essential for maintaining the performance of the ML system. They also agreed that ML systems need to be retrained regularly to adjust to recent data. By analyzing the problem domain, a requirements engineer should specify when and how often retraining is necessary. A requirements engineer should also specify conditions for data anomalies that may potentially lead to unreasonable behavior of the ML system during runtime.

4.2. Evaluate results

Previous evaluation steps dealt with factors such as the accuracy and generality of the model. This step assesses the degree to which the model meets the business objectives and seeks to determine if there is some business reason why this model is deficient. Another option of evaluation is to test the model(s) on test applications in the real application if time and budget constraints permit.

Moreover, evaluation also assesses other data mining results generated. Data mining results cover models which are necessarily related to the original business objectives and all other findings which are not necessarily related to the original business objectives but might also unveil additional challenges, information or hints for future directions.

Assessment of data mining results with respect to business success criteria. Summarize assessment results in terms of business success criteria including a final statement whether the project already meets the initial business objectives.

5. Deployment

Creation of the model is generally not the end of the project. Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process across the enterprise. In many cases it is the customer, not the data analyst, who carries out the deployment steps.

5.1. Deployment plan

In order to deploy the data mining result(s) into the business, this task takes the evaluation results and concludes a strategy for deployment. If a general procedure has been identified to create the relevant model(s), this procedure is documented here for later deployment. Summarize deployment strategy including necessary steps and how to perform them.

5.2. Verify model quality prior to deployment

Model quality must be validated before attempting to serve it: After a model is trained but before it actually affects real traffic, an automated system needs to inspect it and verify that its quality is sufficient; that system must either bless the model or veto it, terminating its entry to the production environment.

How? It is important to test for both slow degradations in quality over many versions as well as sudden drops in a new version. For the former, setting loose thresholds and comparing against predictions on a validation set can be useful; for the latter, it is useful to compare predictions to the previous version of the model while setting tighter thresholds.

5.3. Integration test the entire pipeline

The full ML pipeline is integration tested: A complete ML pipeline typically consists of assembling training data, feature generation, model training, model verification, and deployment to a serving system. Although a single engineering team may be focused on a small part of the process, each stage can introduce errors that may affect subsequent stages, possibly even several stages away. That means there must be a fully automated test that runs regularly and exercises the entire pipeline, validating that data and code can successfully move through each stage and that the resulting model performs well.

How? The integration test should run both continuously as well as with new releases of models or servers, in order to catch problems well before they reach production. Faster running integration tests with a subset of training data or a simpler model can give faster feedback to developers while still backed by less frequent, long running versions with a setup that more closely mirrors production.

5.4. Canary testing of the models

Models are tested via a canary process before they enter production serving environments: Offline testing, however extensive, cannot by itself guarantee the model will perform well in live production settings, as the real world often contains significant non-stationarity or other issues that limit the utility of historical data. Consequently, there is always some risk when turning on a new model in production.

One recurring problem that canarying can help catch is mismatches between model artifacts and serving infrastructure. Modeling code can change more frequently than serving code, so there is a danger that an older serving system will not be able to serve a model trained from newer code. For example, as shown in Figure 2, a refactoring in the core learning library might change the low-level implementation of an operation Op in the model from Op0.1 to a more efficient implementation, Op0.2. A newly trained model will thus expect to be implemented with Op0.2; an older deployed server will not include Op0.2 and so will refuse to load the model.

How? To mitigate the mismatch issue, one approach is testing that a model successfully loads into production serving binaries and that inference on production input data succeeds. To mitigate the new-model risk more generally, one can turn up new models gradually, running old and new models concurrently, with new models only seeing a small fraction of traffic, gradually increased as the new model is observed to behave sanely.

5.5. Have a rollback system

Models can be quickly and safely rolled back to a previous serving version: A model “roll back” procedure is a key part of incident response to many of the issues that can be detected by the monitoring discussed in Section V. Being able to quickly revert to a previous known-good state is as crucial with ML models as with any other aspect of a serving system. Because rolling back is an emergency procedure, operators should practice doing it normally, when not in emergency conditions.

5.6. Plan monitoring and maintenance

Monitoring and maintenance are important issues if the data mining result becomes part of the day-to-day business and its environment. A careful preparation of a maintenance strategy helps to avoid unnecessarily long periods of incorrect usage of data mining results. In order to monitor the deployment of the data mining result(s), the project needs a detailed plan on the monitoring process. This plan takes into account the specific type of deployment.

It is crucial to know not just that your ML system worked correctly at launch, but that it continues to work correctly over time. An ML system by definition is making predictions on previously unseen data, and typically also incorporates new data over time into training. The standard approach is to monitor the system, i.e. to have a constantly-updated “dashboard” user interface displaying relevant graphs and statistics, and to automatically alert the engineering team when particular metrics deviate significantly from expectations. For ML systems, it is important to monitor serving systems, training pipelines, and input data. Here we recommend specific metrics to monitor throughout the system. The usual sorts of incident response approaches will apply; one unique to ML is to roll back not the system code but the learned model, hence our test earlier (test Infra 7) to regularly ensure that this process is safe and easy

ML Pipeline Monitoring and Response. Monitoring ML pipelines and responding to bugs involve tracking live metrics (via queries or dashboards), slicing and dicing sub-populations to investigate prediction quality, patching the model with non-ML heuristics for known failure modes, and finding in-the-wild failures and adding them to the evaluation set.

5.6.1. Monitor for abnormal outputs

5.6.2. Collect and validate against real KPIs

Offline proxy metrics correlate with actual online impact metrics: A user-facing production system's impact is judged by metrics of engagement, user happiness, revenue, and so forth. A machine learning system is trained to optimize loss metrics such as log-loss or squared error. A strong understanding of the relationship between these offline proxy metrics and the actual impact metrics is needed to ensure that a better scoring model will result in a better production system.

How? The offline/online metric relationship can be measured in one or more small scale A/B experiments using an intentionally degraded model.

5.6.3. Monitor for performance regressions

The model has not experienced a dramatic or slow-leak regressions in training speed, serving latency, throughput, or RAM usage: The computational performance (as opposed to predictive quality) of an ML system is often a key concern at scale. Deep neural networks can be slow to train and run inference on, wide linear models with feature crosses can use a lot of memory; any ML model may take days to train; and so forth. Swiftly reacting to changes in this performance due to changes in data, features, modeling, or underlying compute library or infrastructure is crucial to maintaining a performant system.

How? While measuring computational performance is a standard part of any monitoring, it is useful to slice performance metrics not just by the versions and components of code, but also by data and model versions. Degradations in computational performance may occur with dramatic changes (for which comparison to performance of prior versions or time slices can be helpful for detection) or in slow leaks (for which a pre-set alerting threshold can be helpful for detection)

5.6.4. Monitor for quality regressions

• Online measurement of accuracy: Just as you need to know the latency of your website and public application programming interfaces, you need to know how accurate your models are in production. How many predictions actually came true? This requires collecting and logging real-use results but is an elementary requirement.

The model has not experienced a regression in prediction quality on served data: Validation data will always be older than real serving input data, so measuring a model’s quality on that validation data before pushing it to serving is only an estimate of quality metrics on actual live serving inputs. However, it is not always possible to know the correct labels even shortly after serving time, making quality measurement difficult.

How? Here are some options to make sure that there is no degradation in served prediction quality due to changes in data, differing codepaths, etc. Measure statistical bias in predictions, i.e. the average of predictions in a particular slice of data. Generally speaking, models should have zero bias, in aggregate and on slices (e.g. 90% of predictions of probability 0.9 should in fact be positive). Knowing that a model is unbiased is not enough to know it is any good, but knowing there is bias can be a useful canary to detect problems. • In some tasks, the label actually is available immediately or soon after the prediction is made (e.g. will a user click on an ad). In this case, we can judge the quality of predictions in almost real-time and identify problems quickly. • Finally, it can be useful to periodically add new training data by having human raters manually annotate labels for logged serving inputs. Some of this data can be held out to validate the served predictions. However the measure can be done, thresholds must be set as to acceptable quality (e.g. based on bounds of quality at the launch of the initial system), and then a responder should be notified immediately if quality drifts outside that threshold. As with computational performance, it is crucial to monitor both dramatic and slow-leak regressions in prediction quality.

5.6.5. Monitor for data drift

• Mind the gap: That is, watch out for gaps between the distributions of your training and online data sets. This is a simple-to-measure, effective-in-practice heuristic that uncovers a variety of issues. If your training data has 50% high-risk patients, but in production, you're predicting only 30% as high-risk, it's probably time to retrain.

• Online data quality alerts: If the number or ratio of the input data changes in an unexpected way, an alert should go to your operations team. Are you patients suddenly older, more female or less diabetic? If you haven't trained your model on those types of patients, you may be serving bad predictions.

5.6.6. Monitoring and maintenance plan.

Summarize monitoring and maintenance strategy including necessary steps and how to perform them.

Machine Learning – Workflow (MLWFv1)

1. Business Understanding

1.1. Business Goals

1.1.1. 📖 Background Report

1.1.2. 📖 Business objectives

1.1.3. 📖 Business success criteria

1.2. Assess situation

1.2.1. 📖 Inventory of resources

1.2.2. 📖 Requirements Report

1.2.3. 📖 Assumptions

1.2.4. 📖 Constraints

1.2.5. 📖 Terminology

1.2.6. 📖 User Stories

1.2.7. 📖 Domain Models

1.2.8. 📖 Costs and benefits

1.2.9. 📖 Risks and Contingencies

1.3. Determine data mining goals

1.4. Produce project plan

1.4.1. Maintainability

1.4.2. Portability

1.4.3. Reliability

1.5. 👨‍🏫 Educating Stakeholders

1.5.1. 👨‍🏫 Focus on Augmenting People, Not Replacing Them

1.5.2. 👨‍🏫 Make data literacy an organization-wide priority, not just among people within the technology org.

1.5.3. 👨‍🏫 Develop an internal common language for speaking about data, how it intersects with your business and industry, and how it is changing specific roles at your company.

1.5.4. 👨‍🏫 No dataset pitfall

1.5.5. 👨‍🏫 No classification error allowed

1.5.6. 👨‍🏫 Time limits restrict model quality