Key Characteristics of Big Data


Data Volume. 44x incease from 2010 to 2020 (1.2 zettabytes to 35.2zb). Processing complexing.Big data can come in multiple forms. Everything from highly structured financial data, to text
files, to multi-media files and genetic mappings. The high volume of the data is a consistent
characteristic of big data. As a corollary to this, because of the complexity of the data itself,
the preferred approach for processing big data is in parallel computing environments and
Massively Parallel Processing (MPP), which enable simultaneous, parallel ingest and data
loading and analysis. As we will see in the next slide, most of the big data is unstructured or
semi-structured in nature, which requires different techniques and tools to process and analyze.

In addition, you will likely have unstructured or semi-structured data, such as free form call log information, taken from an email ticket of the problem or an actual phone call description of a technical problem and a solution. The most salient information is often hidden in there. Another possibility would be voice logs or audio transcripts of the actual call that might be associated with the structured data. Until recently, most analysts would NOT be able to analyze the most common and highly structured data in this call log history RDBMS, since the mining of the textual information is very labor intensive and could not be easily automated.

Here are examples of what each of the 4 main different types of data structures may look like. People tend to be most familiar with analyzing structured data, while semi-structured data (shown as XML here), quasi-structured (shown as a clickstream string), and unstructured
data present different challenges and require different techniques to analyze. For each data type shown, answer these questions:
1) What type of analytics are performed on these data?
2) Who analyzes this kind of data?
3) What types of data repositories are suited for each, or requirements you may have for
storing and cataloguing this kind of data?
4) Who consumes the data?
5) Who manages and owns the data?