## Big Data with MATLAB |

Big data refers to the dramatic increase in the amount and rate of data being created and made available for analysis.

A primary driver of this trend is the ever increasing digitization of information. The number and types of acquisition devices and other data generation mechanisms are growing all the time.

Big data sources include streaming data from instrumentation sensors, satellite and medical imagery, video from security cameras, as well as data derived from financial markets and retail operations. Big data sets from these sources can contain gigabytes or terabytes of data, and may grow on the order of megabytes or gigabytes per day.

Big data represents an opportunity for analysts and data scientists to gain greater insight and to make more informed decisions, but it also presents a number of challenges. Big data sets may not fit into available memory, may take too long to process, or may stream too quickly to store. Standard algorithms are usually not designed to process big data sets in reasonable amounts of time or memory. There is no single approach to big data. Therefore, MATLAB provides a number of tools to tackle these challenges.

**64-bit Computing.**The 64-bit version of MATLAB drastically increases the amount of data you can hold in memory – typically up to 2000 times more than any 32-bit program. While 32-bit programs limit you to addressing only 2 GB of memory, 64-bit MATLAB lets you address up to the physical memory limits of the OS. For Windows 8, that’s 500 GB for desktop versions and 4 TB for Windows Server.**Memory Mapped Variables.**The`memmapfile`

function in MATLAB lets you map a file, or a portion of a file, to a MATLAB variable in memory. This allows you to efficiently access big data sets on disk that are too large to hold in memory or that take too long to load.**Disk Variables.**The`matfile`

function lets you access MATLAB variables directly from MAT-files on disk, using MATLAB indexing commands, without loading the full variables into memory. This allows you to do block processing on big data sets that are otherwise too large to fit in memory.**Intrinsic Multicore Math.**Many of the built-in mathematical functions in MATLAB, such as`fft`

,`inv`

, and`eig`

, are multithreaded. By running in parallel, these functions take full advantage of the multiple cores of your computer, providing high-performance computation of big data sets.**GPU Computing.**If you’re working with GPUs, GPU-optimized mathematical functions in Parallel Computing Toolbox provide even higher performance for big data sets.**Parallel Computing.**Parallel Computing Toolbox provides a parallel`for`

-loop that runs your MATLAB code and algorithms in parallel on multicore computers. If you use MATLAB Distributed Computing Server, you can execute in parallel on clusters of machines that can scale up to thousands of computers.**Cloud Computing.**You can run MATLAB computations in parallel using MATLAB Distributed Computing Server on Amazon’s Elastic Computing Cloud (EC2) for on-demand parallel processing on hundreds or thousands of computers. Cloud computing lets you process big data without having to buy or maintain your own cluster or data center.**Distributed Arrays.**Using Parallel Computing Toolbox and MATLAB Distributed Computing Server, you can work with matrices and multidimensional arrays that are distributed across the memory of a cluster of computers. Using this approach, you can store and perform computations on big data sets that are too large to fit in a single computer’s memory.**Streaming Algorithms.**Using System objects, you can perform stream processing on incoming streams of data that are too large or too fast to hold in memory. In addition, you can generate embedded C/C++ code from your MATLAB algorithms using MATLAB Coder, and run the resulting code on high-performance real-time systems.**Image Block Processing.**The`blockproc`

function in Image Processing Toolbox lets you work with really big images by processing them efficiently a block at a time. Computations run in parallel on multiple cores and GPUs when used with Parallel Computing Toolbox.**Machine Learning.**Machine learning is helpful for extracting insights and developing predictive models with big data sets. A wide variety of machine learning algorithms including boosted and bagged decision trees, K-means and hierarchical clustering, K-nearest neighbor search, Gaussian mixtures, the expectation maximization algorithm, hidden Markov models, and neural networks are available in Statistics Toolbox and Neural Network Toolbox.

- Large Data Sets in MATLAB 47:40 (Webinar)
- In-memory Big Data Analysis with PCT and MDCS (Blog)
- Plot (Big) Function for Visualizing Large Data Sets (File Exchange)
- Using memmapfile to Navigate through “Big Data” Binary Files (Loren on the Art of MATLAB blog)
- Dealing with “Really Big” Images: Block Processing (
*Steve on Image Processing*Blog) - Stream Processing in MATLAB with System Objects (Overview)
- Using Distributed Arrays to Process Large Matrices 7:36 (Video)
- Solving Large-Scale Linear Algebra Problems Using SPMD and Distributed Arrays (Article)

- Tesco Uses Supply Chain Analytics to Save £100m a Year (Article and Related Video)
- Naturalistic Automobile Driving Data Processing and Analysis (Article and Related Video)
- Edwards Air Force Base Accelerates Analysis of Large Flight Test Data (User Story)
- Seismic Data Processing with High-Performance Computing (Overview)
- Modeling How Ecosystems Affect Regional Climates (Article)

- Strategies for Efficient Use of Memory in MATLAB (Documentation)
- Overview of Memory-Mapping in MATLAB (Documentation)
- Large Multi-Entry Text Files (FASTA, FASTQ, SAM) (Documentation)
- Distinct Block Processing in Image Processing Toolbox (Documentation)

*See also*: *HDF5 files*, *large data import (in Database Toolbox)*