Part of the power of data frame operations is that they typically do not actually perform any modifications (copying etc) of the underlying data, and ultimately only impact how iteration over that data is done, and what "view" of the data is presented. This allows data frame operations to be extremely performant, even when applied on very big (multi-gigabyte) data aset.
Note that the Arrow JS Table
class inherits from the DataFrame
class which is why the examples in this section can use DataFrame
methods to Table
instances.
Also, most of the data frame operations do not modify the original Table
or DataFrame
, but rather return a new similar object with new filtering or "iteration constraints" applied. So memory is usually not changed or modified during these operations.
References:
A simplest way to remove rows from a data frame mey be use Table.slice(start, end)
. As usual, rather than actually modifying memory, this operation returns a new Table
/DataFrame
with iteration constrained to a sub set of the rows in the original frame.
The Table.select(keys: String[])
method drops all columns except the columns with names that match the supplied keys
.
table.select(['name', 'age']); // Drop all colums except name and age
Another way to "remove" rows from data frames is to apply filters. Filters effectively "removes" rows that don't fullfill the predicates in the filter. For details see the note below.
const selectedName = 'myname';
// Remove all rows with name === 'myname'
const dataFrame = table.filter(arrow.predicate.col('name').eq(selectedName));
The predicates classes provided by arrow allows for the comparison of column values against literals or javascript values (equality, greater or equal than, less or equal than) as well as the creation of composite logical expressions (and
, or
and not
) out of individual column comparisons.
It is also possible to write custom predicates by supplying an arbitrary JavaScript function to filter a row, however performance is usually best when using the built-in comparison predicates.
Note that calling
filter()
on aDataFrame
doesn't actually remove any rows from the underlying data store (it just stores the predicates). It's not until you iterate over the date, e.g. by callingcountBy()
orscan()
that we actually apply the filter on the rows.
To count the number of times different values appear in a table, use countBy()
.
const newTable = table.countBy('column_name');
Note that countBy()
does not return a modified data frame or table, but instead returns a new Table
that contains two columns, value
and count
. Each distinct value in the specified column in the original table is listed once in value
, and the corresponding count
field in the same row indicates how many times it was present in the original table.
Note that the results are not sorted.
DataFrames do not currently support sorting. To sort you need to move the data back to JavaScript arrays.
The DataFrame.scan()
method lets you define a custom function that will be called for each (non-filtered) record in the DataFrame
.
Note: For simpler use cases, it is recommended to use the Arrow API provided predicates etc rather than writing a custom scan function, as performance will often be better.
next
callback for scan()
In order to be more efficient, Arrow data is broken up into batches of records (which is what makes it possible to do concatenations despite the columnar layout, and DataFrame.scan()
does not hide this implementation detail from you.
scan()
performance with bind()
callbacksIn addition to the next
callback, you can supply a bind
function for scan to call each time it starts reading from a new RecordBatch
. scan
will call these functions as illustrated in the following pseudo-code:
for (const batch of batches) {
bind(batch);
for (const index in batch) {
next(index, batch);
}
}
Note:
index
passed to next only applies to the current RecordBatch, it is not a global index.RecordBatch
is passed to next
, so it is possible to access data without writing a bind function, but there will be a performance penalty if your data has a lot of batches.