Skip to main content

Data Types

Arrow supports a rich set of data types:

  • Fixed-length primitive types: numbers, booleans, date and times, fixed size binary, decimals, and other values that fit into a given number
  • Variable-length primitive types: binary, string
  • Nested types: list, struct, and union
  • Dictionary type: An encoded categorical type

Data Type Descriptor Objects

Converting Dates

Apache Arrow Timestamp is a 64-bit int of milliseconds since the epoch, represented as two 32-bit ints in JS to preserve precision. The fist number is the "low" int and the second number is the "high" int.

function toDate(timestamp) {
return new Date((timestamp[1] * Math.pow(2, 32) + timestamp[0]) / 1000);
}

Data Types Reference

At the heart of Arrow is set of well-known logical data types, ensuring each Column in an Arrow Table is strongly-typed. These data types define how a Column's underlying buffers should be constructed and read, and includes configurable (and custom) metadata fields for further annotating a Column. A Schema describing each Column's name and data type is encoded alongside each Column's data buffers, allowing you to consume an Arrow data source without knowing the data types or column layout beforehand.

Each data type falls into one of three rough categories: Fixed-width types, variable-width types, or composite types that contain other Arrow data types. All data types can represent null values, which are stored in a separate validity bitmask. Follow the links below for a more detailed description of each data type.

Fixed-width Data Types

Fixed-width data types describe physical primitive values (bytes or bits of some fixed size), or logical values that can be represented as primitive values. In addition to an optional Uint8Array validity bitmask, these data types have a physical data buffer (a TypedArray corresponding to the data type's physical element width).

  • Null - A column of NULL values having no physical storage
  • Bool - Booleans as either 0 or 1 (bit-packed, LSB-ordered)
  • Int - Signed or unsigned 8, 16, 32, or 64-bit little-endian integers
  • Float - 2, 4, or 8-byte floating point values
  • Decimal - Precision-and-scale-based 128-bit decimal values
  • FixedSizeBinary - A list of fixed-size binary sequences, where each value occupies the same number of bytes
  • Date - Date as signed 32-bit integer days or 64-bit integer milliseconds since the UNIX epoch
  • Time - Time as signed 32 or 64-bit integers, representing either seconds, millisecond, microseconds, or nanoseconds since midnight (00:00:00)
  • Timestamp - Exact timestamp as signed 64-bit integers, representing either seconds, milliseconds, microseconds, or nanoseconds since the UNIX epoch
  • Interval - Time intervals as pairs of either (year, month) or (day, time) in SQL style
  • FixedSizeList - Fixed-size sequences of another logical Arrow data type

Variable-width Data Types

Variable-width types describe lists of values with different widths, including binary blobs, Utf8 code-points, or slices of another underlying Arrow data type. These types store the values contiguously in memory, and have a physical Int32Array of offsets that describe the start and end indicies of each list element.

  • List - Variable-length sequences of another logical Arrow data type
  • Utf8 - Variable-length byte sequences of UTF8 code-points (strings)
  • Binary - Variable-length byte sequences (no guarantee of UTF8-ness)

Composite Data Types

Composite types don't have physical data buffers of their own. They contain other Arrow data types and delegate work to them.

  • Union - Union of logical child data types
  • Map - Map of named logical child data types
  • Struct - Struct of ordered logical child data types