Data Analytics Lifecycle Phase 2: Data Preparation

Data Preparation

Of all of the phases, the step of Data Preparation is generally the most iterative and time intensive.

In this step, you will need to define a space where you can explore the data without interfering with live production databases.

For instance, you may need to work with a company’s financial data, but cannot interact with the production version of the organization’s main database since that will be tightly controlled and needed for financial reporting.

You should be collecting all kinds of data in your sandbox, as you will need access to high volumes and varieties of data for a Big Data analytics project. This can include everything from summary, structured data, to raw data feeds, to unstructured text data from call logs or web logs, depending on the kind of analysis you are looking to undertake. Expect this sandbox to be large, at least 10 times the size of a company’s EDW.

Make sure you have strong bandwidth and network connections to the underlying data sources so you can quickly do transformations on the data or extractions from data sets you will need. Note that the graphic above indicates doing “ELT”, rather than ETL, which is a more typical approach to approaching data extractions. In ETL, users perform Extract – Transform – Load processes to get data into a database and perform data transformations before data is loaded into the database. Using the analytic sandbox approach, we advocate doing ELT – Extract, Load, then Transform. In this case, the data is extracted in its raw form and loaded into the database, where analysts can choose to transform the data into a new state or leave it in its original, raw condition. The reason for this approach is that there is significant value in preserving the raw data and including it in the sandbox, before any transformations.

For instance, consider the example of an analysis for fraud detection on credit card usage. Many times, the outliers in this data population can represent higher-risk transactions that may be indicative of fraudulent credit card activity. Using ETL, these outliers may be inadvertently filtered out or transformed and cleaned before being loaded into the database. For this reason, ELT is encouraged so that you have data in its raw state and also the ability to transform it after loading as well. This approach will give you clean data to analyze after its
in the database and the data in its original form for finding hidden nuances in the data. Tools Hadoop can perform massively parallel ingest and custom analysis for parsing web traffic, GPS location analytics, genomic analysis and for combining massive unstructured data feeds
from multiple sources. Alpine Miner provides a GUI interface for creating analytic workflows, including data manipulations and a series of analytic events such as staged data mining techniques (eg., first select top 100 customers, then run descriptive statistics and clustering) on Postgres SQL and other big data sources.

People  For Phase 2, you will need assistance from IT, DBAs or whoever controls the EDW or data sources you will be using.