Data Analytics Lifecycle Phase4: Model Building


In this phase, the model is fit on the training data and evaluated (scored) against the test data. Generally this work takes place in the sandbox, not in the live production environment. The phases of Model Planning and Model Building overlap quite a bit, and in practice one can iterate back and forth between the two phases for a while before settling on a final model. Some methods require the use of a training data set, depending on whether it is a supervised or unsupervised algorithm for machine learning. Although the modeling techniques and logic required to build this step can be highly complex, the actual duration of this phase can be quite short, compared with all of the preparation required on the data and defining the approaches. In general, plan to spend more time preparing and learning the data (Phases 1-2) and crafting a presentation of the findings (Phase 5), where phases 3 and 4 tend to move more quickly, although more complex from a conceptual standpoint.

As part of this phase, you’ll need to conduct these steps:

1) Execute the models defined in Phase 3

2) Where possible, convert the models to SQL or similar, appropriate database language and execute as in-database functions, since the runtime will be significantly faster and more efficient than running in memory. (execute R models on large data sets as PL/R or SQL (PL/R is a PostgreSQL language extension that allows you to write PostgreSQL functions and aggregate functions in R). SAS Scoring Accelerator enables you to run the SAS models in database, if they were created using SAS Enterprise Miner.

3) Use R (or SAS) models on file extracts for testing and small data sets

4) Assess the validity of the model and its results (for instance, does it account for most of the data, and does it have robust predictive power?)

5) Fine tune the models to optimize the results (for example modify variable inputs) 6) Record the results, and logic of the model.


While doing these iterations and refinement of the model, consider the following:

• Does the model look valid and accurate on the test data?

• Does the model output/behavior makes sense to the domain experts? That is, does it look like the model is giving “the right answers”, or answers that make sense in this context?

• Is the model accurate enough to meet the goal?

• Is it avoiding the kind of mistakes it needs to avoid? Depending on context, false positives may be more serious or less serious than false negatives, for instance. (False positives and negatives will be discussed further in Module 3)

• Do the parameter values of the fitted model make sense in the context of the domain?

• Do you need more data or more inputs? Do you need to transform or eliminate any of the inputs?

• Do you need a different form of model? If so, you’ll need to go back to the Model Planning phase and revise your modeling approach.