The usage of tmatch2
is
stilts <stilts-flags> tmatch2 ifmt1=<in-format> ifmt2=<in-format> icmd1=<cmds> icmd2=<cmds> ocmd=<cmds> omode=out|meta|stats|count|checksum|cgi|discard|topcat|samp|plastic|tosql|gui out=<out-table> ofmt=<out-format> matcher=<matcher-name> values1=<expr-list> values2=<expr-list> params=<match-params> tuning=<tuning-params> join=1and2|1or2|all1|all2|1not2|2not1|1xor2 find=all|best|best1|best2 fixcols=none|dups|all suffix1=<label> suffix2=<label> scorecol=<col-name> progress=none|log|time|profile runner=parallel|parallel<n>|parallel-all|sequential|classic|partest [in1=]<table1> [in2=]<table2>If you don't have the
stilts
script installed,
write "java -jar stilts.jar
" instead of
"stilts
" - see Section 3.
The available <stilts-flags>
are listed
in Section 2.1.
For programmatic invocation,
the Task class for this
command is uk.ac.starlink.ttools.task.TableMatch2
.
Parameter values are assigned on the command line as explained in Section 2.3. They are as follows:
find = all|best|best1|best2
(PairMode)
all
: All matches.
Every match between the two tables is included in the result.
Rows from both of the input tables may appear multiple times in the result.best
: Best match, symmetric.
The best pairs are selected in a way which treats the two tables symmetrically.
Any input row which appears in one result pair is disqualified from appearing in any other result pair, so each row from both input tables will appear in at most one row in the result.best1
: Best match for each Table 1 row.
For each row in table 1, only the best match from table 2 will appear in the result.
Each row from table 1 will appear a maximum of once in the result, but rows from table 2 may appear multiple times.best2
: Best match for each Table 2 row.
For each row in table 2, only the best match from table 1 will appear in the result.
Each row from table 2 will appear a maximum of once in the result, but rows from table 1 may appear multiple times.best
, best1
and best2
are a bit subtle.
In cases where it's obvious which object in each table
is the best match for which object in the other,
choosing betwen these options will not affect the result.
However, in crowded fields
(where the distance between objects within one or both tables is
typically similar to or smaller than the specified match radius)
it will make a difference.
In this case one of the asymmetric options
(best1
or best2
)
is usually more appropriate than best
,
but you'll have to think about which of them suits your
requirements.
The performance (time and memory usage) of the match
may also differ between these options,
especially if one table is much bigger than the other.
[Default: best
]
fixcols = none|dups|all
(Fixer)
none
: columns are not renameddups
: columns which would otherwise have duplicate names in the output will be renamed to indicate which table they came fromall
: all columns will be renamed to indicate which table they came fromsuffix*
parameters.
[Default: dups
]
icmd1 = <cmds>
(ProcessingStep[])
in1
,
before any other processing has taken place.
The value of this parameter is one or more of the filter
commands described in Section 6.1.
If more than one is given, they must be separated by
semicolon characters (";").
This parameter can be repeated multiple times on the same
command line to build up a list of processing steps.
The sequence of commands given in this way
defines the processing pipeline which is performed on the table.
Commands may alternatively be supplied in an external file,
by using the indirection character '@
'.
Thus a value of "@filename
"
causes the file filename
to be read for a list
of filter commands to execute. The commands in the file
may be separated by newline characters and/or semicolons,
and lines which are blank or which start with a
'#
' character are ignored.
A backslash character '\
' at the end of a line
joins it with the following line.
icmd2 = <cmds>
(ProcessingStep[])
in2
,
before any other processing has taken place.
The value of this parameter is one or more of the filter
commands described in Section 6.1.
If more than one is given, they must be separated by
semicolon characters (";").
This parameter can be repeated multiple times on the same
command line to build up a list of processing steps.
The sequence of commands given in this way
defines the processing pipeline which is performed on the table.
Commands may alternatively be supplied in an external file,
by using the indirection character '@
'.
Thus a value of "@filename
"
causes the file filename
to be read for a list
of filter commands to execute. The commands in the file
may be separated by newline characters and/or semicolons,
and lines which are blank or which start with a
'#
' character are ignored.
A backslash character '\
' at the end of a line
joins it with the following line.
ifmt1 = <in-format>
(String)
in1
.
The known formats are listed in Section 5.1.1.
This flag can be used if you know what format your
table is in.
If it has the special value
(auto)
(the default),
then an attempt will be
made to detect the format of the table automatically.
This cannot always be done correctly however, in which case
the program will exit with an error explaining which
formats were attempted.
This parameter is ignored for scheme-specified tables.
[Default: (auto)
]
ifmt2 = <in-format>
(String)
in2
.
The known formats are listed in Section 5.1.1.
This flag can be used if you know what format your
table is in.
If it has the special value
(auto)
(the default),
then an attempt will be
made to detect the format of the table automatically.
This cannot always be done correctly however, in which case
the program will exit with an error explaining which
formats were attempted.
This parameter is ignored for scheme-specified tables.
[Default: (auto)
]
in1 = <table1>
(StarTable)
-
",
meaning standard input.
In this case the input format must be given explicitly
using the ifmt1
parameter.
Note that not all formats can be streamed in this way.:<scheme-name>:<scheme-args>
.<
" character at the start,
or a "|
" character at the end
("<syscmd
" or
"syscmd|
").
This executes the given pipeline and reads from its
standard output.
This will probably only work on unix-like systems.in2 = <table2>
(StarTable)
-
",
meaning standard input.
In this case the input format must be given explicitly
using the ifmt2
parameter.
Note that not all formats can be streamed in this way.:<scheme-name>:<scheme-args>
.<
" character at the start,
or a "|
" character at the end
("<syscmd
" or
"syscmd|
").
This executes the given pipeline and reads from its
standard output.
This will probably only work on unix-like systems.join = 1and2|1or2|all1|all2|1not2|2not1|1xor2
(JoinType)
1and2
: An output row for each row represented in both input tables (INNER JOIN)1or2
: An output row for each row represented in either or both of the input tables (FULL OUTER JOIN)all1
: An output row for each matched or unmatched row in table 1 (LEFT OUTER JOIN)all2
: An output row for each matched or unmatched row in table 2 (RIGHT OUTER JOIN)1not2
: An output row only for rows which appear in the first table but are not matched in the second table2not1
: An output row only for rows which appear in the second table but are not matched in the first table1xor2
: An output row only for rows represented in one of the input tables but not the other one[Default: 1and2
]
matcher = <matcher-name>
(MatchEngine)
params
,
values*
and
tuning
parameter(s).
[Default: sky
]
ocmd = <cmds>
(ProcessingStep[])
Commands may alternatively be supplied in an external file,
by using the indirection character '@
'.
Thus a value of "@filename
"
causes the file filename
to be read for a list
of filter commands to execute. The commands in the file
may be separated by newline characters and/or semicolons,
and lines which are blank or which start with a
'#
' character are ignored.
A backslash character '\
' at the end of a line
joins it with the following line.
ofmt = <out-format>
(String)
(auto)
"
(the default),
then the output filename will be
examined to try to guess what sort of file is required
usually by looking at the extension.
If it's not obvious from the filename what output format is
intended, an error will result.
This parameter must only be given if
omode
has its default value of "out
".
[Default: (auto)
]
omode = out|meta|stats|count|checksum|cgi|discard|topcat|samp|plastic|tosql|gui
(ProcessingMode)
out
, which means that
the result will be written as a new table to disk or elsewhere,
as determined by the out
and ofmt
parameters.
However, there are other possibilities, which correspond
to uses to which a table can be put other than outputting it,
such as displaying metadata, calculating statistics,
or populating a table in an SQL database.
For some values of this parameter, additional parameters
(<mode-args>
)
are required to determine the exact behaviour.
Possible values are
out
meta
stats
count
checksum
cgi
discard
topcat
samp
plastic
tosql
gui
help=omode
flag
or see Section 6.4 for more information.
[Default: out
]
out = <out-table>
(TableConsumer)
This parameter must only be given if
omode
has its default value of "out
".
[Default: -
]
params = <match-params>
(String[])
matcher
parameter.
If it contains multiple values, they must be separated by spaces;
values which contain a space can be 'quoted' or "quoted".
progress = none|log|time|profile
(String)
The options are:
none
:
no progress is shown
log
:
progress information is shown
time
:
progress information and some time profiling
information is shown
profile
:
progress information and limited time/memory profiling
information are shown
[Default: log
]
runner = parallel|parallel<n>|parallel-all|sequential|classic|partest
(RowRunner)
parallel
:
uses multithreaded implementation for large tables,
with default parallelism,
which is the smaller of 6
and the number of available processors
parallel<n>
:
uses multithreaded implementation for large tables,
with parallelism given by the supplied value
<n>
parallel-all
:
uses multithreaded implementation for large tables,
with a parallelism given by the number of
available processors
sequential
:
uses multithreaded implementation
but with only a single thread
classic
:
uses legacy sequential implementation
partest
:
uses multithreaded implementation even when tables are small
parallel*
options
should normally run faster than
sequential
or classic
(which are provided mainly for testing purposes),
at least for large matches
and where multiple processing cores are available.
The default value "parallel
"
is currently limited to a parallelism of 6
since larger values yield diminishing returns given that
some parts of the matching algorithms run sequentially
(Amdahl's Law), and using too many threads
can sometimes end up doing more work
or impacting on other operations on the same machine.
But you can experiment with other concurrencies,
e.g. "parallel16
" to run on 16 cores
(if available) or "parallel-all
"
to run on all available cores.
The value of this parameter should make no difference to the matching results. If you notice any discrepancies please report them.
[Default: parallel
]
scorecol = <col-name>
(String)
matcher
,
but it typically represents a distance of some kind between
the two matching points.
If a null value is chosen, no score column will be inserted
in the output table.
The default value of this parameter depends on
matcher
.
[Default: Score
]
suffix1 = <label>
(String)
fixcols
parameter
is set so that input columns are renamed for insertion into
the output table, this parameter determines how the
renaming is done.
It gives a suffix which is appended to all renamed columns
from table 1.
[Default: _1
]
suffix2 = <label>
(String)
fixcols
parameter
is set so that input columns are renamed for insertion into
the output table, this parameter determines how the
renaming is done.
It gives a suffix which is appended to all renamed columns
from table 2.
[Default: _2
]
tuning = <tuning-params>
(String[])
matcher
parameter.
If it contains multiple values, they must be separated by spaces;
values which contain a space can be 'quoted' or "quoted".
If this optional parameter is not supplied, sensible defaults
will be chosen.
values1 = <expr-list>
(String[])
matcher
.
Depending on the kind of match, the number and type of
the values required will be different.
Multiple values should be separated by whitespace;
if whitespace occurs within a single value it must be
'quoted' or "quoted".
Elements of the expression list are commonly just column
names, but may be algebraic expressions calculated from
zero or more columns as explained in Section 10.
values2 = <expr-list>
(String[])
matcher
.
Depending on the kind of match, the number and type of
the values required will be different.
Multiple values should be separated by whitespace;
if whitespace occurs within a single value it must be
'quoted' or "quoted".
Elements of the expression list are commonly just column
names, but may be algebraic expressions calculated from
zero or more columns as explained in Section 10.