RecordBatchReader
This documentation reflects Arrow JS v4.0. Needs to be updated for the new Arrow API in v9.0 +.
The RecordBatchReader is the IPC reader for reading chunks from a stream or file
Usage
The JavaScript API supports streaming multiple arrow tables over a single socket.
To read all batches from all tables in a data source:
const readers = RecordBatchReader.readAll(fetch(path, {credentials: 'omit'}));
for await (const reader of readers) {
for await (const batch of reader) {
console.log(batch.length);
}
}
If you only have one table (the normal case), then there'll only be one RecordBatchReader/the outer loop will only execute once. You can also create just one reader via
const reader = await RecordBatchReader.from(fetch(path, {credentials: 'omit'}));
Methods
readAll() : AsyncIterable<RecordBatchReader>
Reads all batches from all tables in the data source.
from(data : *) : RecordBatchFileReader | RecordBatchStreamReader
data
- Array
- fetch response object
- stream
The RecordBatchReader.from
method will also detect which physical representation it's working with (Streaming or File), and will return either a RecordBatchFileReader
or RecordBatchStreamReader
accordingly.
Remarks:
- if you're fetching the table from a node server, make sure the content-type is
application/octet-stream
toNodeStream()
pipe()
You can also turn the RecordBatchReader into a stream if you're in node, you can use either toNodeStream() or call the pipe(writable) methods
in the browser (assuming you're using the UMD or "browser" fields in webpack), you can call
toDOMStream() or
pipeTo(writable)/pipeThrough(transform)
In the browser (assuming you're using the UMD or "browser" fields in webpack), you can call toDOMStream()
or pipeTo(writable)
/pipeThrough(transform)
You can also create a transform stream directly, instead of using RecordBatchReader.from()
You can also create a transform stream directly, instead of using RecordBatchReader.from()
throughNode
throughDOM
via throughNode()
and throughDOM()
respectively:
- https://github.com/apache/arrow/blob/49b4d2aad50e9d18cb0a51beb3a2aaff1b43e168/js/test/unit/ipc/reader/streams-node-tests.ts#L54
- https://github.com/apache/arrow/blob/49b4d2aad50e9d18cb0a51beb3a2aaff1b43e168/js/test/unit/ipc/reader/streams-dom-tests.ts#L50
By default the transform streams will only read one table from the source readable stream and then close, but you can change this behavior by passing { autoDestroy: false }
to the transform creation methods
Remarks
- Reading from multiple tables (
readAll()
) is technically an extension in the JavaScript Arrow API compared to the Arrow C++ API. The authors found it was useful to be able to send multiple tables over the same physical socket so they built the ability to keep the underlying socket open and read more than one table from a stream. - Note that Arrow has two physical representations, one for streaming, and another for random-access so this only applies to the streaming representation.
- The IPC protocol is that a stream of ordered Messages are consumed atomically. Messages can be of type
Schema
,DictionaryBatch
,RecordBatch
, orTensor
(which we don't support yet). The Streaming format is just a sequence of messages with Schema first, thenn
DictionaryBatches
, thenm
RecordBatches
.