Phase 3 represents the last step of preparations before executing the analytical models and, as such, requires you to be thorough in planning the analytical work and experiments in the next phase. This is the time to refer back to the hypotheses you developed in Phase 1, when you first began getting acquainted with the data and your understanding of the business problems or domain area. These hypotheses will help you frame the analytics you’ll execute in Phase 4, and choose the right methods to achieve your objectives. Some of the conditions to consider include: – Structure of the data. The structure of the data is one factor that will dictate the tools and analytical techniques you can use in the next phase. Depending on whether you are analyzing textual data or transactional data will require different tools and approaches (eg., Sentiment Analysis using Hadoop) than forecasting market demand based on structured financial data (for example revenue projections and market sizing using regressions). – Ensure that the analytical techniques will enable you to meet the business objectives and prove or disprove your working hypotheses. – Determine if your situation warrants a single test (eg., Binomial Logistic Regression or Market Basket Analysis) or a series of techniques as part of a larger analytic workflow. A tool such as Alpine Miner will enable you to set up a series or steps and analyses (eg., select top 100,000 customers ranked by account value, then predict the likelihood of churn based on another set of heuristics) and can serve as a front end UI for manipulating big data sources in Postgres SQL.
In addition to the above, consider how people generally solve this kind of problem and look to address this type of question. With the kind of data and resources you have available, consider if similar approaches will work or if you will need to create something new. Many times you can get ideas from analogous problems people have solved in different industry verticals.
There are many tools available to you, here are a few….
– R has a complete set of modeling capabilities and provides a good environment for building interpretive models with high quality code. In addition, it has the ability to interface with Postgres SQL and execute statistical tests and analyses against big data via an open source connection (R/PostgresSQL). These two factors make R well suited to performing statistical tests and analytics on big data. R contains over 3,000 packages for analysis and graphical representation. New packages are being posted all the time, and many companies are providing value add services for R (training, instruction, best practices), as well as packaging it in ways to make it easier to use and more robust. This phenomenon is similar to what happened with Linux in the late 1980s and early 1990s, where companies appeared to package and make Linux easier for companies to consume and deploy. Use R with file extracts for off-line analysis and optimal performance. Use R/PostgresSQL for dynamic queries and faster development.
– SQL Analysis services can perform in-database analytics of common data mining functions, involved aggregations and basic predictive models.
– SAS/ACCESS provides integration between SAS and external database via multiple data connectors such as OBDC, JDBC, and OLE DB. SAS itself is generally used on file extracts, but with SAS / ACCESS, you can connect to relational databases (such as Oracle or Teradata), and data warehouse appliances (such as Greenplum or Aster), files, and enterprise applications (eg., SAP, Salesforce, etc.).
• There is some exploration in the data prep phase, mostly for data hygiene reasons and to assess the quality of the data itself. In this phase, it is important to explore the data to understand the relationships among the variables to inform selection of the variables and methods, and to understand the problem domain. Using tools to help you with data visualizations can help you with this, and aid you in previewing the data and assessing relationships between variables.
• In many cases, stakeholders and subject matter experts will have gut feelings about what you should be considering and analysis. Likely, they had some hypothesis which led to the genesis of the project. Many times stakeholders have a good grasp of the problem and domain, though they may not be aware of the subtleties within the data or the model needed to prove or disprove a hypothesis. Other times, stakeholders may be correct, but for an unintended reason (correlation does not imply causation). Data scientists have to come in unbiased, and be ready to question all assumptions.
• Consider the inputs/data that you think you will need, then examine whether these inputs are actually correlated with the outcomes you are trying to predict or analyze. Some methods and types of models will handle this well, others will not handle it as well. Depending on what you are solving for, you may need to consider a different method, winnow the inputs, or transform the inputs to allow you to use the best method.
• Aim for capturing the most essential predictors and variables, rather considering every possible variable that you think may influence the outcome. This will require iterations and testing in order to identify the most essential variables for the analyses you select. Test a range of variables to include in your model, then focus on the most important and influential variables. Dimensionality reduction.
Variable Selection (cont.)
• If running regressions, identify the candidate predictors and outcomes of the model. Look to create variables that will determine outcomes, but will provide strong relationship to outcome rather than to each other. Be vigilant for problems such as serial correlation, collinearity and other typical data modeling problems that will interfere with the validity of these models. Consider this within the context of how you framed the problem. Sometimes correlation is all you need (“black box prediction”), and in other cases you will want the causal relationship (when you want the model to have explanatory power, or when you want to forecast/stress test under situations a bit out of your range of observations – always dangerous).
• Converting the model created in R or a native statistical package to SQL will enable you to run the operation in-database, which will provide you with the optimal performance during runtime.
• Consider the major data mining and predictive analytical techniques, such as Categorization, Association Rules, and Regressions. • Determine if you will be using techniques that are best suited for structured data, unstructured data, or a hybrid approach. You can leverage MapReduce to analyze unstructured data.
You can move to the next Phase when….you have a good idea about the type of model to try and you can refine the analytic plan. This includes a general methodology, solid understanding of the variables and techniques to use, and a description or diagramming of the analytic workflow.