Introduction
Apache Arrow is a binary specification and set of libraries for representing Tables and Columns of strongly-typed fixed-width, variable-width, and nested data structures in-memory and over-the-wire.
Arrow represents columns of values in sets of contiguous buffers. This is in contrast to a row-oriented representation, where the values for each row are stored in a contiguous buffer. The columnar representation makes it easier to take advantage of SIMD instruction sets in modern CPUs and GPUs, and can lead to dramatic performance improvements processing large amounts of data.
Components
The Arrow library is organized into separate components responsible for creating, reading, writing, serializing, deserializing, or manipulating Tables or Columns.
- Data Types - Classes that define the fixed-width, variable-width, and composite data types Arrow can represent
- Vectors - Classes to read and decode JavaScript values from the underlying buffers or Vectors for each data type
- Builders - Classes to write and encode JavaScript values into the underlying buffers or Vectors for each data type
- Visitors - Classes to traverse, manipulate, read, write, or aggregate values from trees of Arrow Vectors or DataTypes
- IPC Readers and Writers - Classes to read and write the Arrow IPC (inter-process communication) binary file and stream formats
- Fields, Schemas, RecordBatches, Tables, and Columns - Classes to describe, manipulate, read, and write groups of strongly-typed Vectors or Columns
Concepts
it's probably good to define some terminology:
Dataa collection of rows in contiguous Arrow memory. This is called "Array" in most arrow implementations but is calledDatain Arrow JS to avoid shadowing the JSArraytype.Datacan have one or more underlying buffers but those buffers all represent the same data. E.g. integer storage like aDataof typeUint8has two buffers: one for the raw data (directly viewable by aUint8Array) and another for the nullability bitmask: one bit for each row to confer whether the row is null or not. Nested types can have more buffers. E.g. points can be represented as aDataof struct type, where there's a buffer for thexcoordinates and another buffer for theycoordinates.Vectora collection of rows in batches. This is essentially a list ofData.Field: metadata that describes an individualDataorVector. This containsname: string, data type,nullable: bool, andmetadata: Map<string, string>.Schema: metadata that describes a named collection ofDataorVector. This is essentiallyList<Field>, but it can also store optional associatedmetadata: Map<string, string>.RecordBatchan ordered and named collection ofDatainstances. This is essentially aList<Data>plus aSchema.Table: an ordered and named collection ofVectorinstances. This is essentially aList<Vector>plus aSchema.
Data Types
At the heart of Arrow is set of well-known logical data types, ensuring each Column in an Arrow Table is strongly-typed. These data types define how a Column's underlying buffers should be constructed and read, and includes configurable (and custom) metadata fields for further annotating a Column. A Schema describing each Column's name and data type is encoded alongside each Column's data buffers, allowing you to consume an Arrow data source without knowing the data types or column layout beforehand.
Each data type falls into one of three rough categories: Fixed-width types, variable-width types, or composite types that contain other Arrow data types. All data types can represent null values, which are stored in a separate validity bitmask. Follow the links below for a more detailed description of each data type.