NAME

runacbea - use evolutionary algorithm to optimize cluster benchmark

SYNOPSIS

  runacbea -config name [option] ...

DESCRIPTION

Runacbea addresses the issue of optimizing the floating-point calculation rate of a compute cluster, as reported by HPL (High-Performance Linpack), the benchmark that is used to rank world's most powerful Top 500 computers.

HPL measures floating-point calculation rate by using matrix operations to solve a large, dense system of simultaneous equations. It achieves its best performance figures when the problem size approaches the maximum that can be accommodated by the random-access memory of the system under test. The actual performance achieved is heavily dependent upon sixteen parameters given to HPL, which describe how it should partition the large problem into progressively smaller problems for solution.

As HPL's execution time for a large problem is typically measured in hours, and because the parameter space is very large, it is impractical to assess the effectiveness of particular parameter combinations by repeatedly running large problems. Runacbea runs a large number of small benchmarks in order generate parameter sets that are likely to give good results when used with longer-running tests. This automatic procedure is replaces the hand-tuning that has traditionally been necessary in order to get the best possible benchmark result.

The program runs on the the head node of a compute cluster, and submits jobs to the cluster's compute nodes. The jobs repeatedly benchmark performance using relatively small problems. An evolutionary algorithm is used to optimize the results over several generations.

Information about the benchmark, its variable parameters, and their allowed values, are read from a configuration file, which also describes the job submission system and the inter-node communication method. The name of the configuration file must be provided; other parameters are optional, and affect the operation of the evolutionary algorithm, the amount of parallelism that it uses, and the disposition of output files.

As runacbea executes, it sends status information to standard output. This output is more voluminous if output is to a terminal, rather than to a plain file. There is no option to reduce the amount of output.

Upon completion, the program delivers a parameter file that can be used for a further runacbea run, usually with a larger problem size, less parallelism and a greater number of compute nodes per benchmark (see -size and -j). Parameters that have consistently produced poor results do not appear in this file, making it likely that the subsequent run will not explore unprofitable parameter combinations.

The program also delivers files defining a number of discrete parameter sets (see -n) that are likely to produce good results in individual hand-launched benchmark runs.

OPTIONS

Mandatory

-config config_file_name: The configuration file, for example mycluster.acbea. Its XML-based format is described in acbea(5). The extension .acbea is conventional, but not required.

Options Affecting the Evolutionary Algorithm

-g number_of_generations_to_run (default: 20): Run the evolutionary algorithm for the specified number of generations. Each generation consists of the number of populations set by the -n option; each population contains the number of individuals specified by the -p option.
-n number_of_populations (default: 5): Specify the number of populations in each generation. If the number of populations is greater than one, the evolutionary algorithm evolves a number of populations in parallel. Individuals may migrate between populations at each generation -- see -ir. Runacbea reports the best result for each population at the end of the run.
-p number_of_individuals_in_each_population (default: 40): Specify the number of individuals in each population.
-cr crossover_rate (between 0.0 and 1.0; default 1.0): Specify the proportion of breeding pairs of organisms that use crossover when creating an individual for the next generation. Where crossover is used, each child gene is selected at random from the corresponding gene in each of the two parents; if it is not used, the child is a clone of one of its parents, while those of the other parent's genes are ignored.
-ir immigration_rate (between 0.0 and 1.0; default 0.05): Specify the proportion of each population that is replaced by individuals from other populations in each generation.
-sr survival_rate (between 0.0 and 1.0; default 0.1): Specify the proportion of the fittest individuals in each population that survives unchanged to the next generation. A survival rate of greater than zero provides for an elitism policy in the evolutionary algorithm.
-no-scaling: By default, the evolutionary algorithm applies a sigma scaling to raw fitness values (that is, floating-point execution rate achieved) in order better to separate fitnesses that are some distance above the mean fitness of a population, and so make it more likely that the fittest individuals survive to breed. The -no-scaling option suppresses this heuristic.
-seed random_number_seed: The pseudo-random number generator used by the evolutionary algorithm is normally initialized from the host system's entropy, and hence produces a different sequence of values on each run. If -seed is given, the specified value is used, providing for repeatability between runs. It should be noted, however, that random variations in benchmark performance will result in differing outcomes for repeated runs, even if the pseudo-random number sequence is fixed.

Other Options

-j number_of_parallel_jobs (default 1): By default, the program evaluates one individual fitness (that is, set of benchmark parameters) at a time; all available processor cores are assigned to that single calculation. Thus, to evaluate the fitness of a population, its individuals' fitnesses are each evaluated in turn; to evaluate a generation, each population is evaluated in turn. By setting the number of parallel jobs to more than one, the available cores can be divided among a number of simultaneous fitness calculations. See NOTES for a discussion of the relationship between the number of parallel jobs, the number of available cores, and the benchmark parameters P and Q.
-o basename_for_output_files (default nextrun): Specify the leading part of the names of the output files. See also FILES.
-preserve percentage_of_parameters_to_preserve (default 66.7): The parameter values allowed in the subsequent run are those which account for some percentage of the upper part of the results in the current run; parameter values which appear only in the lower part are discarded, so limiting the extent of the parameter space explored by the subsequent run. The --preserve option specifies the percentage of the results used to determining the values to preserve.
-size next_run_problem_size_increase (default 4): The nextrun.acbea output file specifies parameters for a subsequent run. The -size option gives the square power of two (1, 4, 16, 64, 256) by which the next run's problems are larger than those evaluated in the current run. For example, a factor of four doubles N, the side length of the square matrix of simultaneous equations, resulting in a quadrupling of the memory requirement, and an increase of eight times in the number of CPU cycles required for solution.
-debug: If this option is specified, files relating to individual problem evaluation are preserved at the end of the run, rather than being deleted. Additionally, a ``residual check'' is applied to check the correctness of the solution of each problem run, rather than just to the problems in the first generation.

DIAGNOSTICS

Runacbea issues a warning, but continues running, if the number of jobs (see -j) is not a multiple of the scale factor (see -size). This probably means either that the cluster has insufficient resources to run the jobs in parallel, or that some cluster resources will remain unused while the jobs are running.

Runacbea complains and exits with non-zero status on detecting error conditions that prevent benchmark jobs from being submitted. These include:

Non-existent or unwriteable input files or necessary directories
Bad configuration file format or content
The return of fewer results than expected by a benchmark batch job. This may indicate that:
- The target system offers fewer resources than are necessary to satisfy the requirements of a single batch job -- for example, because the configuration file specifies more nodes than the cluster contains.
- The host match string (see ENVIRONMENT) does not match the name a sufficient number of cluster nodes.
- The configuration file allows an illegal set of parameters to be generated for a particular benchmark run.
- The MPI (Message-Passing Interface) run command and parameters are incorrectly or incompletely specified in the configuration file.
- The failure of a ``residual check'' for any HPL problem. Such a failure probably indicates a problem in the core BLAS (Basic Linear Algebra Subroutines) implementation used by HPL. (In order to save time, runacbea skips residual checks after the first generation unless the -debug option is in force.)
Examination of the error output captured by the batch job submission system (see FILES) may help in diagnosing the problem.

Note that temporary non-availability of sufficient resources to run submitted batch jobs is not an error condition; the jobs -- and hence runacbea -- will wait until sufficient resources become available. This behaviour is intentional: it allows runacbea to perform the initial stages of benchmark parameter optimization on a cluster that is being used by others; only in the final stages is it necessary to dedicate the cluster to benchmarking.

NOTES

Prerequisites

Runacbea is designed to run on a Beowulf-style compute cluster with a UNIX-derived operating environment. The following facilities are mandatory:

C/C++ build and run-time environment.
MPI build and run-time environment. In particular, a working mpicc is required.
Batch job submission and management system. Note that, at the present time, runacbea has been tested only with OAR, although it is designed to accommodate alternatives.
Remote command execution facility for batch jobs, allowing a process on one compute node to launch processes on others, satisfying checks of credentials if necessary. (Runacbea has been tested using oarsh.)
Perl, version 5.6.1 or later.

These facilities are optional:

Visual cluster monitoring system, for example Ganglia.
Visual batch job monitoring system, such as Monika.
CBLAS implementation (Basic Linear Algebra Subroutines with C language interface). By default, the benchmark programs are linked to the ATLAS implementation of CBLAS, which is built as part of the runabea build process. An alternative implementation of CBLAS may be used by providing options to the configure script for HPL. Run ./configure --help in the hpl-1.0x subdirectory of the distribution for further information.

Benchmarking a cluster

Establish the following facts concerning your cluster:
- The number of compute nodes.
- The number of cores per node.
- The amount of memory per node.
- If the cluster is composite (that is, if it contains more than one type of compute node), a hostname selector that will match only those nodes that you want to benchmark. (Runacbea cannot handle configurations having varying numbers of cores per node.)
- The type of batch job scheduler being used. If this is not among those with a corresponding run_scheduler.sh script in the distribution's util/ directory, you will have to create one. Run_oar.sh may be useful as a template.
- The command and any necessary options needed to invoke a program that uses MPI (Message-Passing Interface) for inter-process communication. Important options specify the technology used to interconnect nodes, and the remote shell command used by one compute node to run a program on another. See the sample configuration file for an example.
Once the type of the batch scheduler has been established, snode, to be found in the util directory, can be used to establish node-related information.
Build runacbea and the two benchmark programs dhpl and xhpl. (Of the latter, xhpl is the original benchmark program delivered as part of the HPL package; dhpl is a variant adapted to runacbea's need to test many unrelated sets of parameters in a single benchmark run.)

You must build on a compute node of the target cluster. This is because the build procedure for the ATLAS BLAS (Basic Linear Algebra Subroutines) library used by default by the benchmark optimizes for the system on which it is built; building on a different host results in a library that is not optimized for the target cluster. An example of the OAR batch job management system command line needed to obtain an interactive session on a compute node is
```
 oarsub -I
```
Check your documentation if you are using another job manager.

Currently, building is achieved by running temporary-build-script in the top-level directory of the distribution, hpc-ga-bench. (The intention is to replace this method with conventional GNU Autotools-based configure and make steps in a subsequent release.) Build time is dominated by the automatic optimization of the ATLAS library, and may be expected to be between 30 minutes and two hours, depending on the speed of the node, and whether the ATLAS distribution contains a preset optimization for the host system.
Determine a value for N, the problem matrix dimension, that results in a benchmark run time of approximately ten seconds on a single compute node processor core. This is accomplished by running util/ten-sec-n.
Make sure that the bin and util subdirectories of the distribution directory, hpc-ga-bench, are on you search path.

Create a subdirectory of the distribution directory in which to run tests.
Copy the sample configuration file, libacbea.x.y.z/config/sample.acbea to the directory created in step 4, and edit it to suit your configuration, incorporating the information obtained in step 1.

At this stage, you must decide how many compute nodes -- and hence, how many cores -- to use for the first stage of benchmark parameter evolution. As a rule of thumb, use a quarter of the nodes on a small cluster (one having tens of nodes), a sixteenth on larger clusters (on the order of a hundred nodes) and so on.

To maintain the ten-second benchmark run-time, multiply the value for N obtained in step 3 above by the cube root of half the number of cores on which you intend running the benchmark. (Other things being equal, problem run time is proportional to the cube of N; dividing the number of cores by two allows for things not being equal -- in particular for the time lost in inter-process communication.)
Make a trial run of runacbea. A command line similar to
```
 runacbea -config myconfig.acbea -g 2 -p 10
```
should work. This will run two generations of five populations, each of ten individuals, for a total of 100 individuals. The test should take around twenty minutes to run. Your cluster monitoring software should show no idle time on any of the nodes allocated to the test, and the proportions of system and user cycles should be similar across all nodes. (If some nodes show a high proportion of system time, you may have specified a P value that does not allow all nodes to be used for calculation.)

If the test fails, check your parameter file, and try again. See DIAGNOSTICS for hints as to what might go wrong.
Make an initial run of runacbea, using a command line such as
```
 runacbea -config myconfig.acbea -j 4
```
(You may also want to use the -o option to change the names of runacbea's output files from their defaults. On a large cluster which can support more than four parallel jobs, you should use the factor you chose in step 5 above for the -j option.)

This runs twenty generations of five populations, each of 40 individuals, for a total of 2,000 individuals. Run time will be on the order of six hours. This may double or worse if other users are utilizing some of the cluster's nodes, so preventing all jobs from running in parallel.
Make a second run of runacbea, using the parameter file generated by the first:
```
 runacbea -config nextrun.acbea -o lastrun -size 1
```
(If you used more than four jobs in step 7, specify the -j option with a quarter of the number of jobs used in step 7. In this case, you should omit the -size option.)

Run time will be approximately double that for step 6. If this is unacceptably long, just run ten generations (-g option). If no parallelism is possible, the cluster must be dedicated to this test.

Repeat this step until runacbea produces single jobs that run on all the nodes in the cluster. In this case, the processes value specified in the generated configuration file will be equal to the number of compute cores in the cluster.
By this stage, runabea has produced five optimized parameter files, lastrun_01.dat ... lastrun_05.dat to serve as a basis for individual runs of the HPL benchmark program, xhpl. In order to obtain the best possible results, you should modify the value of N so that the benchmark consumes the cluster's entire memory -- although you should be aware that complexity is proportional to the cube of N, making run times potentially very long. A rule-of-thumb formula to calculate N is

Having edited a parameter file, copy it to HPL.dat, and run xhpl. This requires that you submit a suitable command file to your batch job management system. For OAR, this file would contain a line such as
```
 mpirun -mca pls_rsh_agent oarsh -mca btl ^openib \
     -hostfile $OAR_NODEFILE -np 16 ../bin/xhpl
```
Here, the number of cores (-np option) is equal to the product of the P and Q options in the HPL.dat file, and the remaining options are the same as those that you gave for mpirunflags in step 5 above. This file should be submitted to the job manager with a command line such as
```
 oarsub -l nodes=2/core=8,walltime=24:00:00 command_file.sh
```
Again, if you have an alternative batch job manager, check your documentation.

Lastrun_01.dat is expected to produce the best result (in lastrun_01.out); however, you should also try at least some of the other parameter files.

While the benchmark is running, use your cluster monitoring software to check that node memory is as nearly fully-used as possible. (But not over-used: if paging commences, results will be greatly degraded, and you should abort the benchmark an revise the value of N downwards.)
Publish your results!

Choosing Parameters

This section discusses the relationship between the following parameters for the HPL benchmark and runacbea:

nodes (defined in param.acbea): The number of compute nodes used to run each individual benchmark. (Not the total number of nodes in the cluster.)
processes (defined in param.acbea): The number of processes used in executing each benchmark.
P, Q (defined in param.acbea): The number of rows and columns respectively in the matrix of compute cores that is used to execute each benchmark.
jobs (optionally defined on runacbea command line; defaults to 1): The number of simultaneously-submitted batch jobs used to run the benchmarks.
size (optionally defined on runacbea command line; defaults to 4): The factor by which problems in the generated next run configuration file are larger than those solved in the current run.

The rules set out below should be followed if the target cluster is to be used most efficiently, and if the benchmark results are to be optimal. Runacbea does not check adherence to the rules, and will attempt to run benchmarks even if they are not obeyed. In some cases this will result in job rejection due to insufficient resources; in others, results will be obtained, but are likely to be suboptimal because, for example, some cluster nodes are under-utilized, or because some are overloaded.

nodes * jobs == Number of nodes in cluster: To be precise, here and below, nodes == Number of nodes in cluster matching the hostprefix specified in the configuration file.
max(P) is an integer factor of processes: The evolutionary algorithm is allowed to set P to any one of a number of values; the largest of these is generally equal to processes, but any integer factor of processes is also allowable for the maximum.
P * Q == processes: This is managed automatically, provided that the condition above is satisfied
processes / nodes == Number of cores per compute node: If this relation does not hold, some cores will be under- or over-utilized.
nodes * size <= Number of nodes in cluster: If nodes * size is greater than the number of nodes in the cluster, the generated nextrun.acbea file will specify problems that demand more resources than the cluster can offer.

A small example

Consider a small cluster having four compute nodes, each with eight processor cores.

(Q is always set to a value such that P*Q == processes.)

These parameters would run each the individuals of each population across sixteen cores in two nodes in sequence; only one benchmark would run at a time, and two cluster nodes would be idle (or available for other purposes).

nodes=2, processes=16, max(P)=16, jobs=2: These parameters would run two parallel jobs, which would share the tasks of running the individuals, each of which would execute using sixteen cores on two nodes. This means that two benchmarks run simultaneously, fully utilizing all nodes in the cluster.
nodes=2, processes=16, max(P)=16, jobs=3: These parameters would attempt to run three parallel jobs, each again executing sixteen-core benchmarks. However, there being insufficient resources to allow this, the third job would be queued until the first or second had completed. Because of the greater job-management overhead, and because two nodes would be unused while the third job completed, wall-clock time would be greater than for the second case above.

In all of the above problems, size has been left at its default of 4, with the result that the problems specified by the resulting nextrun.acbea files would demand 64 cores -- more than exist in the cluster. They would, however, be suitable for an eight-node cluster.

FILES

param.acbea (input): XML-format definition of benchmark environment and allowed values for parameters -- see acbea(5). (This file can actually have any name and extension of the user's choice.)
nextrun.acbea (output): Runacbea configuration file defining optimized parameters for a subsequent run, usually with a larger problem size.
nextrun_nn.dat (outputs): Individual xhpl parameter files corresponding to the best parameters discovered in each population, and suitable for running benchmarks by hand, possibly after editing to tweak parameters such as N, the problem size, and changing the filename to the HPL.dat expected by xhpl. Nextrun_01.dat corresponds to the ``best-of-best'' set of parameters, nextrun_02.dat to the second best, and so on.
acbea-yyddmm-HHMMSS (output directory): If the -debug option is specified, or if runacbea encounters an error, this directory contains the following files for each job run by runacbea:
acbearc: If a file acbearc exists in runacbea's working directory, it will be sourced by every job launch script. This allows the script environment to be customized -- for example, to add directories to the default LD_LIBRARY_PATH.

ENVIRONMENT

Runacbea places the following variables in the environment of each program that it launches to manage benchmark batch jobs:

ACBEA_PROCS: The number of processes available to the current batch job. Should be equal to the product of the number of processor cores per node and ACBEA_NODES (see below).
ACBEA_HOSTPREFIX: A Perl regular expression matching the right part of the hostnames of the compute nodes that should be used for the current batch job; the left parts are assigned by the job manager. Set from configuration file. The intention is to allow some subset of the available nodes to be used for a benchmark, for example where a cluster is made up of two types of node which are distinguishable by their hostnames.
ACBEA_NAME: The name of the current batch job, used for reporting by the batch scheduler. Has the form seq_nnn.
ACBEA_NODES: The number of compute nodes (not compute cores) available to the current batch job. See also NOTES.
ACBEA_PROJECT: The name of the project, used for reporting by the batch scheduler. Defined at libacbea build-time; defaults to hpl-ga-bench.
ACBEA_WALLTIME: The expected run-time of the job, plus a safety margin, in minutes. Currently not set; scheduler wrapper scripts should use a conservative default.
ACBEA_WORKDIR: The directory used for job management, input and output files. Set from configuration file.

BUGS

The following problems are known to exist in the current release:

Error handling is not always as graceful as it could be, in particular when the batch job submission script is misbehaving.
The benchmarking procedure (see NOTES) would benefit from further simplification and automation.
Runacbea should be checkpointable, in order that interrupted runs can be resumed.
If the configuration used for a first run specifies jobs that use a number of cores (processes) that is not a whole multiple of the number of cores per compute node, the <-size> option produces incorrect processes and nodes numbers in the next run configuration file.
The work directory may contain several thousand files at the end of a complex run; perhaps its organization should not be flat.

If you discover a bug, please log it using the event tracking system at https://gforge.uni.lu/projects/hpc-ga-bench.

AUTHOR

Dominic Dunlop, mailto:/dominic.dunlop@uni.lu

COPYRIGHT AND LICENCE

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; for details see http://www.gnu.org/copyleft/fdl.html.