runacbea -config config.acbea ...
Aspects of runacbea's operation related to the benchmark parameter space and the tools used to explore it are specified by an XML-format configuration file, conventionally named config.acbea. However, the file may have any name, which may include any (or no) filename extension.
The file contains a single acbea_config
element with the following content:
acbea
An optional empty element specifying the version of the DTD to which the file conforms.
description
A required empty element having attributes that describe the benchmark run specified by the file.
prime
A required empty element having attributes that describe the benchmark execution environment. (The somewhat irrelevant name of this element is inherited from the ACOVEA package upon which ACBEA is based.)
parameters
A required element specifying the names and allowed values for the benchmark parameters that may be varied by ACOVEA's evolutionary algorithm.
The required elements are described in detail in the subsections that follow. When
specifying values for attributes, be aware that characters that are special to XML must be
represented by entities such as <
, >
, and &
for <, >, and
& respectively in shell commands etc.
The description
element has the following attributes:
value
(required)
A short description of the ACBEA run parameterized by this file, for example ``Evaluation of Acme Institute PetaGiant cluster, 2010-06-05''. This description appears in reports.
version
(default 1.0.0)
The version of benchmark addressed by the file. This information is currently unused, other than in reports.
header
A short description of the function of the file, for example ``HPLinpack discrete problems benchmark input for ACBEA''. This information appears in reports. Configuration files generated by runacbea for subsequent runs of the program will have ``(auto-generated)'' appended to this value.
The prime
element has the following attributes:
batchcommand
(default sh
)
The command to submit a shell script for execution on one or more nodes. Where there is no
batch scheduler, this may simply be sh
, otherwise it should be a script that schedules
a job, then waits for it to complete. (The name of the scheduler command itself is
unlikely to be what's needed here.) The util/run-oar.sh script in the ACBEA
distribution performs this function for the OAR batch job scheduler, and may be used as a
template for the development of scripts suitable for other schedulers. Batchcommand may
be on the search path; otherwise a pathname relative to current directory for runacbea,
or a full pathname may be used.
hostselect
(default ^node-
)
A Perl regular expression matching the addresses of cluster nodes that may be used for execution. The default is likely to work on homogeneous clusters, but more complex expressions may be used to select nodes of a particular class in a heterogeneous cluster, or to avoid particular dead or degraded nodes. To give a complex example,
^chinqchint-([1-689][^0-9]|1[0-35-9]|2[01267]|3[0-35-8])
selects 21 hosts having fully-qualified domain names starting with the following strings: chinqchint-1., chinqchint-6., chinqchint-8., chinqchint-9., chinqchint-10, chinqchint-11, chinqchint-12, chinqchint-15, chinqchint-16, chinqchint-17, chinqchint-18, chinqchint-19, chinqchint-20, chinqchint-21, chinqchint-22, chinqchint-26, chinqchint-27, chinqchint-30, chinqchint-31, chinqchint-32, chinqchint-33, chinqchint-35, chinqchint-36, chinqchint-37 and chinqchint-38.
mpisetup
(default empty)
This is the command line (if any) needed to set up MPI environment prior to using mpirun; for example,
mpdboot -n $(uniq $OAR_NODEFILE | wc -l) --file=$OAR_NODEFILE --rsh=oarsh
is needed to start daemons if MPICH2 is being used in a cluster managed by OAR.
mpirun
(default mpirun
)
The command to run an MPI-aware application.
mpirunflags
(default empty)
This parameter specifies the options that are needed to make mpirun work, and work well. These may call out a communications fabric and the parameters that it requires. For example,
-hostfile $OAR_NODEFILE -mca pls_rsh_agent oarsh -mca btl ^openib
gives good results with OpenMPI v1.2 on a small gigabit-Ethernet-connected cluster managed by OAR.
mpishutdown
(default empty)
This parameter gives the command line (if any) needed to shut down MPI environment after
using mpirun; for example, mpdallexit
terminates MPICH2 daemons.
processes
(default 1)
This parameter gives the number of MPI processes to use; this should be equal to the number of compute cores available to the benchmark, that is, the number of cores per node multiplied by the number of nodes. (Note that ACBEA cannot currently run benchmarks that use partial nodes.) Configuration files generated by runacbea for subsequent runs of the program multiply this value by the scaling factor (see runacbea).
nodes
(default 1)
This parameter specifies the number of cluster nodes across which to distribute processes. Configuration files generated by runacbea for subsequent runs of the program multiply this value by the scaling factor (See runacbea).
benchcommand
(default dhpl
)
Benchcommand is the command that runs the benchmark. It should be dhpl, or some other command that understands dhpl control files. Because processes run on compute nodes by a job manager typically have a default search path, rather than the search path that is seen on the head node by runacbea, this must be a pathname relative to working directory for runacbea, or a full pathname.
benchflags
(default -f
)
If benchcommand requires command-line flags, they are given here. The name of the parameter file generated by runacbea is appended to these flags when benchcommand is run.
resultfilter
(default awk '/^D[RC].*[0-9]$/{print $NF}'
)
This attribute gives a shell command that isolates GFlops performance figures from the output of the benchmark command. The default value works for dhpl.
resultverify
(default awk '/ (failed|skipped) /{c+=$1}END{exit c}'
)
This shell command returns non-zero status on finding error indicators in the output of the benchmark command. The default value works for dhpl. On the assumption that calculation problems will make themselves apparent early in a runacbea run, verification takes place only in the first generation of evolution unless debugging information has been requested. Output from this command (if any) appears in runacbea's debugging output.
The parameters
element contains one or more parameter
elements. In practice
exactly the parameters named below must be used, and in the order given.
A parameter element requires a name
attribute to give the parameter a name, and may have a description
attribute briefly describing its function. A parameter must be of one of four kinds, specified by its type
attribute:
global
The parameter applies unchanged to all benchmarks run by a single invocation of
dhpl. Its value, which may be integer, floating point or string, is specified by the
value
attribute.
dummy
The parameter value for a particular benchmark will be supplied by runacbea, based on the values of other parameters, rather than by the configuration file.
enum
The integer parameter value for a particular benchmark may assume any one of a number of
values specified as a vertical bar-separated list by the value
attribute, for example
value="1|2|4|8"
. As a degenerate case, a parameter with a fixed value for all
benchmarks may be specified with a list containing a single value, as in value="1"
.
tuning
The integer parameter value for a particular benchmark may assume any one of a number of
values between inclusive lower and upper bounds specified by the min
and max
attributes respectively, and separated by the difference specified by the step
attribute. For example, min="2" max="8" step="2"
would allow a parameter to assume the
values 2, 4, 6 and 8.
The parameters required by dhpl are as follows:
OUTF
(global)
The name of the benchmark result output file. The value is ignored: runacbea generates file names for each invocation of benchcommand.
OUTD
(global)
This parameter tells dhpl where to direct its output: 6=stdout, 7=stderr, other value=file. Runacbea requires that its value be set to 1, indicating output to a file.
THRSH
(global)
Dhpl runs sanity checks on benchmark results if this floating-point parameter, which species an error margin, has a value that is greater than zero. 16.0 is a reasonable value. Note that, by default, runacbea suppresses sanity checks in the second and subsequent generations in order to reduce run-time.
NP
(global)
This parameter specifies the number of problems specified by a dhpl control file. The value is ignored: runacbea generates a value for each invocation of benchcommand.
N
(enum with single possible value)
This parameter specifies the HPL problem size; that is the number of rows and columns in the square matrix of simultaneous equations solved by the benchmark. A suitable starting value may be determined by running ten-sec-n.pl. (See also runacbea/Benchmarking a cluster.)
NB
(enum)
This parameter specifies the dimension of the small sub-matrices into which the problem is ultimately decomposed. It has been found to be the most critical parameter in obtaining optimum benchmark results. The range specified in the sample configuration file,
min="32" max="128" step="8"
is a good starting point, although it may be profitable to
increase max
to 256 for in some cases. Step
should be kept relatively small to avoid
the danger of missing a sharp peak in performance. Odd values seem never to give good
results, and should be avoided.
PMAP
(enum)
This parameter specifies whether the problem matrix is mapped into memory in row- or column-major form (values 0 and 1 respectively). The sample configuration's value="0|1"
should be used.
P
(enum)
P
gives the number of rows in the matrix of compute cores across which the problem is
partitioned for solution. Allowed values should be factors of the number of compute cores
specified by the processes
attribute of prime
(see above). For example, for 36
cores, value="1|2|4|6|9|12|18|36"
gives an exhaustive list of allowable values for
P
. In practice, matrices that are considerably ``over-square'' are likely to give poor
results, so value="1|2|4|6|9"
may be a better list in this case. (Note that specifying
an allowed value that is not a factor of the number of cores may not result in an error,
just poor results for those benchmarks that use it.)
Q
(dummy)
Q
gives the number of columns in the matrix of compute cores across which the problem
is partitioned. Runacbea calculates its value by dividing the number of cores by the
current value of P
.
PFACT
(enum)
PFACT
specifies the panel factorization algorithm used. There are three possibilities:
left-looking (0), Crout's method (1), and right-looking (2). The sample configuration's
value="0|1|2"
should be used.
NBMIN
(enum)
The panel factorization algorithm is recursive, dividing the panel into ever-smaller
sub-panels. NBMIN
specifies the stopping criterion: recursion terminates when the
number of columns in any sub-panel is less than or equal to this number, which must be
greater than one. As values greater than four are seldom worth investigating, the sample
configuration's value="2|3|4"
is probably a good choice.
NDIV
(enum)
The factorization algorithm divides the panel into NDIV
sub-panels at each step. Small
positive powers of two, as specified by the sample configuration's value="2|4|8"
,
should be investigated.
RFACT
(enum)
Analogously to PFACT
, RFACT
specifies left-looking (0), Crout's method (1), or
right-looking (2) recursive factorization. Again, the sample configuration's
value="0|1|2"
should be used.
BCAST
(enum)
This parameter specifies the inter-node broadcast algorithm, which may be:
Increasing ring
Increasing ring (modified)
Increasing two-ring
Increasing two-ring (modified)
Long (bandwidth reducing)
Long (bandwidth reducing modified)
value="0|1|2|3|4|5"
, as specified in the sample configuration, will explore all
possibilities, although, in practice on small- to medium-sized fully-connected clusters
with high-bandwidth interconnect, increasing ring or increasing ring (modified) are likely
to give the best results.
DEPTH
(enum)
The DEPTH
parameter controls the depth to which the factorization algorithm looks ahead
through multiple panels. When DEPTH
is zero, there is no look-ahead; when one, there is
one panel's worth of look-ahead, and so on. HPL's tuning notes state that 0 or 1 are
likely to give the best results, and that look-ahead of depths 3 and larger will
probably not give you better results, making the sample configuration's
value="0|1|2|3"
perhaps a little conservative.
FSWAP
(enum)
The FSWAP
parameter controls the algorithm used to update the trailing
sub-matrix. There are two primary choices: binary exchange (0); and long (1), also known
as spread-roll, which is a bandwidth-reducing variant of binary exchange. A third possibility,
mixed (2), switches from binary exchange to long if the number of columns being processed
exceeds the TSWAP
value -- see below. The sample configuration specifies
value="0|1|2"
, exploring all three possibilities.
TSWAP
(tuning)
The TSWAP
parameter is used only when FSWAP
is 2, and specifies the column count
threshold at which the switch between algorithms takes place. The sample configuration's
min="0" max="128" step="32"
has been found to give reasonable results. Note that a
value of zero results in the long algorithm being used exclusively, while a large value
that is never reached results in exclusive use of binary exchange. Thus an alternative
starting point would be always to make the value of FSWAP
2 (mixed), and allow extreme
values of TSWAP
to explore the no-switch cases.
L1
(enum)
The L1
parameter determines how the upper triangle of the matrix being solved is stored
in memory: 0 selects a transposed form; 1, non-transposed. The example parameters allow
either value.
U
(enum)
Like L1
, U
selects a transposed (1) or non-transposed (0) storage layout, this time
for panel rows. Again, the example parameters allow either value.
E
(enum)
If E
is 1, an equilibration phase is added to the long swapping algorithm -- see
TSWAP
and FSWAP> above; if it is zero, equilibration is skipped. Equilibration takes
time, but is likely to result in better distribution of work across compute nodes. The
example parameters allow either value.
Memory is allocated by HPL at addresses that are a multiple of A
. The fixed value of 8,
the size in bytes of a double-precision floating point number, specified by the sample
configuration, is probably adequate, although value="8|16"
might also be investigated.
The Document Type Definition (DTD) for acbea.config is as follows:
<!ELEMENT acbea_config (acbea?, description, prime, parameters) >
<!ELEMENT acbea EMPTY> <!ELEMENT description EMPTY> <!ELEMENT prime EMPTY> <!ELEMENT parameters (parameter, parameter*) >
<!ELEMENT parameter EMPTY> <!ATTLIST acbea version CDATA "1.0" >
<!ATTLIST description value CDATA #REQUIRED version CDATA "1.0.0" header CDATA #REQUIRED >
<!ATTLIST prime batchcommand CDATA "sh" hostselect CDATA "node-" mpisetup CDATA "" mpirun CDATA "mpirun" mpishutdown CDATA "" mpirunflags CDATA "" processes CDATA "1" nodes CDATA "1" benchcommand CDATA "dhpl" benchflags CDATA "-f" resultfilter CDATA "awk '/^D[RC].*[0-9]$/{print $NF}'" resultverify CDATA "awk '/ (failed|skipped) /{c+=$1}END{print c}'" >
<!ATTLIST parameter type (global|dummy|enum|tuning) #REQUIRED name CDATA #REQUIRED value CDATA #IMPLIED description CDATA #IMPLIED min CDATA #IMPLIED max CDATA #IMPLIED step CDATA #IMPLIED > ]>
It should not be necessary to list the parameters in the correct order (which reflects the required order of the parameters in xhpl's control file): runacbea should be capable of sorting things out.
The description of the hostselect
parameter assumes that the batch job management
system has some means of selecting compute nodes based on a Perl regular expression. While
this is true of OAR because the underlying MySQL database manager can handle such regular
expressions, it is probably not true of all job managers. If you use ACBEA with another
job manager, you may need to change the interpretation of hostselect
.
runacbea(1), ten-sec-n.pl(1), perlre(1), http://www.netlib.org/benchmark/hpl/algorithm.html, http://oar.imag.fr.
Dominic Dunlop, mailto:dominic.dunlop@uni.lu
Copyright (C) 2009 by Dominic Dunlop
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; for details see http://www.gnu.org/copyleft/fdl.html.