Big Data Training: Understanding Big Data Life Cycle

The volume of data that one has to deal with has exploded to unimaginable levels in the past decade, and at the same time, the price of data storage has systematically reduced. Private companies and research institutions capture terabytes of data about their users’ interactions, businesses, social media, and sensors from mobile phones and automobiles. This era's challenge is making sense of this sea of data. This is where big data analytics comes into the picture. Big data training involves collecting data from different sources, managing it in a way that becomes available to be consumed by analysts, and finally delivering data products useful to the organization's business.

Big Data Life Cycle

In the modern big data context, the previous approaches are incomplete and suboptimal. For example, the SEMMA methodology disregards completely data collection and preprocessing of different data sources. These stages constitute most of the work in a successful big data training project.

A big data training cycle can be described by the following stages:

Business Problem Definition

This is a point common in traditional BI and big data analytics life cycles. Normally it is a non-trivial stage of a big data project to define the problem and evaluate correctly how much potential gain it may have for an organization. It seems obvious to mention this, but it has to be evaluated what are the expected gains and costs of the project.

Research

Users can analyze what other companies have done in the same situation. This involves looking for solutions that are reasonable for the company, even though it involves adapting other solutions to the resources and requirements that a company has. In this stage, a methodology for the future stages should be defined.

Human Resources Assessment

Once the problem is defined, it’s reasonable to continue analyzing if the current staff can complete the project successfully. Traditional BI teams might not be capable of delivering an optimal solution to all the stages, so it should be considered before starting the project if there is a need to outsource a part of the project or hire more people.

Data Acquisition

This section is key in a big data life cycle; it defines which type of profiles would be needed to deliver the resultant data product. Data gathering is a non-trivial step of the process; it normally involves gathering unstructured data from different sources. To give an example, it could involve writing a crawler to retrieve reviews from a website. This involves dealing with text, perhaps in different languages normally requiring a significant amount of time to be completed.

Data Munging

Once the data is retrieved, for example, from the web, it needs to be stored in an easy-to-use format. To combine both data sources, a decision has to be made to make these two response representations equivalent. This can involve converting the first data source response representation to the second form, considering one star as negative and five stars as positive. This process often requires a large time allocation to be delivered with good quality.

Data Storage

Once the data is processed, it sometimes needs to be stored in a database. Big data technologies offer plenty of alternatives regarding this point. The most common alternative is using the Hadoop File System for storage which provides users a limited version of SQL, known as HIVE Query Language. This allows most analytics tasks to be done in similar ways as would be done in traditional BI data warehouses, from the user perspective. Other storage options to be considered are MongoDB, Redis, and SPARK.

Exploratory Data Analysis

Once the data has been cleaned and stored in a way that insights can be retrieved from it, the data exploration phase is mandatory. The objective of this stage is to understand the data, this is normally done with statistical techniques and also plotting the data. This is a good stage to evaluate whether the problem definition makes sense or is feasible.

Data Preparation for Modeling and Assessment

This stage involves reshaping the cleaned data retrieved previously and using statistical preprocessing for missing values imputation, outlier detection, normalization, feature extraction, and feature selection.

Modeling

The prior stage should have produced several datasets for training and testing, for example, a predictive model. This stage involves trying different models and looking forward to solving the business problem at hand. In practice, it is normally desired that the model would give some insight into the business. Finally, the best model or combination of models is selected evaluating its performance on a left-out dataset.

Implementation

In this stage, the data product developed is implemented in the data pipeline of the company. This involves setting up a validation scheme while the data product is working, to track its performance. For example, in the case of implementing a predictive model, this stage would involve applying the model to new data and once the response is available, evaluate the model.