install.packages("data.table")33 Introduction
The data.table package provides a modern and highly optimized version of R’s data.frame structure. It is highly memory efficient and can automatically parallelize internal operations to achieve substantial speed improvements over data.frames.
Advantages of data.table include:
- Fast and efficient reading, writing, and handling of big datasets
- fast read & write of delimited files with
fread()andfwrite() - in-place operations without creating unnecessary copies of data
- fast read & write of delimited files with
- Concise and flexible syntax for data manipulation great for handling small or big data
In health data science, it is common to handle very large datasets, especially when working with electronic health record (EHR) data. In such cases, we often have to read, clean, reshape, transform, and merge multiple tables of different dimensions, often featuring many millions of rows and thousands of columns. The benefits of data.table become immediately apparent in such scenarios.
33.1 Installation
To install from CRAN:
data.table includes a built-in command to update to the latest development version:
data.table::update.dev.pkg()33.2 Note on OpenMP support
data.table automatically parallelizes operations behind the scenes when possible. It uses the OpenMP library to support parallelization. The current version of macOS comes with disabled support for OpenMP.
Currently, if you install data.table and OpenMP support is not detected, a message is printed to the console when you load the library with library(data.table) informing you that it is running on a single thread. You can still use data.table without OpenMP support.
The data.table installation wiki describes how to enable OpenMP support in the macOS compiler. The recommended option is to download the libraries from the mac.r-project site and copy them to the /usr/local/lib and /usr/local/include directories as appropriate.
After adding OpenMP support, you can compile the latest version of data.table:
pak::pak("Rdatatable/data.table")If everything works correctly, when you now load the library, it will inform you how many threads are being used.