Data Preprocessing

What Is Data Preprocessing?

Data preprocessing is the task of cleaning and transforming raw data to make it suitable for analysis and modeling. Raw data often includes missing data, outliers, and other inconsistencies, such as formatting issues. Preprocessing steps include data cleaning, data normalization, and data transformation. The goal of data preprocessing is to improve both the accuracy and efficiency of downstream analysis and modeling.

MATLAB® provides apps and functions to preprocess input data to make it suitable for statistical modeling, machine learning algorithms, and other data-driven applications.

Figure 1 shows raw data that includes missing values and outliers, which can lead to erroneous conclusions during analysis. Figure 2 shows the same data after applying three different data preprocessing techniques—filling missing data, removing outliers, and smoothing. The improved data quality now prominently shows attributes such as magnitude, frequency, and nature of periodicity.

MATLAB plot of a raw data set that contains missing values and outliers.

Figure 1. MATLAB plot of raw data containing missing values and outliers.

MATLAB plot showing data with pre- and post-smoothing using data preprocessing techniques such as filling in missing data and outlier removal.

Figure 2. Original and preprocessed data after applying the smoothdata function in MATLAB.

Data Preprocessing Techniques

Data preprocessing techniques can be grouped into three main categories: data cleaning, data transformation, and structural operations. These steps can happen in any order and iteratively.

Data Cleaning

Data cleaning is the process of addressing anomalies in the data set using techniques such as:

  • Managing outliers: Identifying, and then removing outliers, or replacing them with statistically estimated values.
  • Filling missing data: Identifying missing or invalid data points and replacing them with interpolated values.
  • Smoothing: Filtering out noise using techniques such as moving mean, linear regression, and more specialized filtering methods.
Solar irradiance raw input data time plot with missing values plotted in MATLAB.

Figure 3. Time-series plot of a solar irradiance raw data set, including missing values.

MATLAB plot of solar irradiance data after filling in missing values in the raw data set.

Figure 4. Solar irradiance data preprocessed with the fillmissing function in MATLAB to fill in missing values.

Data Transformation

Data transformation is the process of modifying a data set into a preferred format by using operations such as:

  • Normalization and rescaling: Standardizing data sets with different scales into a uniform scale
  • Detrending: Removing polynomial trends to enhance visibility of variations in the data set
MATLAB plot identifying the trend of the raw data and eliminating trend bias using the detrend data preprocessing technique.

Figure 5. Raw data, its trend, and its preprocessed version with trend bias eliminated using the detrend function in MATLAB.

Structural Operations

Structural operations are often used for combining, reorganizing, and categorizing data sets and include:

  • Joining: Combining two tables or timetables by rows using a common key variable
  • Stacking and unstacking: Reshaping multidimensional arrays to consolidate or redistribute data within the table, making it easier for analysis
  • Grouping and binning: Reorganizing the data set to extract valuable insights
  • Calculating pivot tables: Breaking down large tabular data sets into sub-tables to gain focused information

Data Preprocessing and Data Types

Data preprocessing steps can be different depending on the type of data. Here are three examples of different data preprocessing methods, available for various data types.

Time-Series Data Tabular Data Image Data
You can perform a variety of data cleaning and preprocessing tasks such as removing missing values, filtering, smoothing, and synchronizing timestamped data with different time steps. When a table has messy data, you can clean the table by filling in or removing missing values and rearranging table rows and variables in a different order.  Data preprocessing is useful for applications involving images, including AI. You can preprocess your data by resizing or cropping the images, or even by increasing the amount of training data for deep learning models.

 

 

 

Preprocess and Explore Time-Stamped Data Clean Messy and Missing Data in Tables Preprocessing Images for Deep Learning 

Data Preprocessing with MATLAB

Choosing the right preprocessing approach is not always obvious. MATLAB provides both interactive capabilities (apps and Live Editor tasks) and high-level functions that make it easy to try different methods and determine which is right for your data. Iterating through different configurations and selecting the optimal settings will help you prepare your data for further analysis.

Interactive Capabilities

The Data Cleaner app is a standalone interactive tool for preprocessing time-series data without writing code. Figure 6 shows how to import your data and then clean it, fill in missing data, and remove outliers. You can then save your modified data to the MATLAB workspace for further analysis. You can also automatically generate MATLAB code to document your steps and reproduce them later.

Figure 6. Using the Data Cleaner app in MATLAB to explore and clean time-series data.

Live Editor tasks are simple point-and-click interfaces that you can add directly to your script to perform a specific set of operations. These tasks can be configured interactively to iterate through different settings and identify the optimal configuration for your application. As with the Data Cleaner app, you can also automatically generate MATLAB code to reproduce your work.

You can interactively preprocess data using a sequence of Live Editor tasks such as Clean Missing Data, Clean Outlier Data, Normalize Data, etc., by visualizing the data at each step.

A screenshot of the Data Preprocessing toolbar in MATLAB, showing a collection of live tasks available to use.

Figure 7. Data Preprocessing MATLAB toolbar with a collection of live tasks.

A screenshot of a Clean Outlier Data task with the input data set to A, the cleaning method set to filling outliers by linear interpolation, the detection method set to median, and the resulting plot showing two filled outliers.

Figure 8. Clean Outlier Data Live Editor task detecting outliers using median thresholding and filling them using linear interpolation.

Using MATLAB Functions

MATLAB provides thousands of high-level, built-in functions for common mathematical, scientific, and engineering calculations, including data preprocessing.

You can start exploring your raw data set by visualizing it in MATLAB. Figure 9 shows raw data consisting of missing values and outliers. The data set captures the solar irradiance received on a typical day. Harsh weather conditions could interfere with wireless telemetry transmission resulting in a raw data set with imperfections.

MATLAB 2D plot of solar irradiance raw input data that highlights missing values and outliers.

Figure 9. Time-series plot of solar irradiance raw data with its missing values and outliers identified.

Here are five common data preprocessing techniques applied to a raw solar irradiance data set shown in Figure 9 using MATLAB. 

Data Preprocessing Technique MATLAB Plot

Addressing Outliers:

Anomalies in the telemetry data show up as outliers. The outliers are removed using filloutliers. You can specify the method used to determine which values are outliers and a fill technique to estimate a value to replace the outlier data point.

Filling Missing Data:

Loss of communication results in missing data in telemetry. Use fillmissing to replace the NULL values in the data set with an estimated value. You can specify interpolation or moving window-based technique to estimate the missing value.

Smoothing Data:

Noisy solar irradiance data is removed using smoothdata. You can select and specify which smoothing method is best for your data.

Normalize Data:

Using the normalize function, you can easily see that more than 50% of the peak solar irradiance is received between 8 a.m. to 4 p.m. in this data set.

Grouping:

Use retime to group the solar irradiance data in 4-hour intervals to identify the mean solar irradiance in those time spans.

Data can be messy, but data preprocessing techniques can help improve data quality and prepare your data for further analysis. See the resources below for more information.

See also: data cleaning, MATLAB for data analysis, MATLAB graphics