Parquet is a columnar format developed within the Apache project. Data is compressed on disk and read into memory before use.
At present, only very limited metadata is written. Parquet does not seem(?) to have any standard format for per-column metadata, so the only information written about each column apart from its datatype is its name.
Parquet support is currently somewhat experimental.
Note:
The parquet I/O handlers require large external libraries, which are not always bundled with the library/application software because of their size. In some configurations, parquet support may not be present, and attempts to read or write parquet files will result in a message like:Parquet-mr libraries not availableIf you can supply the relevant libaries on the classpath at runtime, the parquet support will work. At time of writing, the required libraries are included in thetopcat-extra.jar
monolithic jar file; they can also be found in the starjava github repository (https://github.com/Starlink/starjava/tree/master/parquet/src/lib, useparquet-mr-stil.jar
and its dependencies), or you can acquire them from the Parquet MR package. These arrangements may be revised in future releases, for instance if parquet usage becomes more mainstream. The required dependencies are those of the Parquet MR submoduleparquet-cli
, in particular the filesparquet-cli-1.11.1.jar
,parquet-column-1.11.1.jar
,parquet-common-1.11.1.jar
,parquet-encoding-1.11.1.jar
,parquet-format-structures-1.11.1.jar
,parquet-hadoop-1.11.1-noshaded.jar
,parquet-jackson-1.11.1.jar
,commons-collections-3.2.2.jar
,commons-configuration-1.6.jar
,commons-lang-2.6.jar
,failureaccess-1.0.1.jar
,guava-27.0.1-jre.jar
,hadoop-auth-2.7.3.jar
,hadoop-common-2.7.3.jar
,log4j-1.2.17.jar
,slf4j-api-1.7.22.jar
,slf4j-log4j12-1.7.22.jar
,snappy-java-1.1.7.3.jar
.
The handler behaviour may be modified by specifying
one or more comma-separated name=value configuration options
in parentheses after the handler name, e.g.
"parquet(groupArray=false)
".
The following options are available:
groupArray = true|false
groupArray=false
will write it as
"repeated int32 IVAL
"
while groupArray=true
will write it as
"optional group IVAL (LIST) { repeated group list
{ optional int32 item} }
".
I don't know why you'd want to do it the latter way,
but some other parquet writers seem to do that by default,
so there must be some good reason.
If no output format is explicitly chosen,
writing to a filename with
the extension ".parquet
" or ".parq
" (case insensitive)
will select parquet
format for output.