
   ==================================================================
   ===                                                            ===
   ===           GENESIS Distributed Memory Benchmarks            ===
   ===                                                            ===
   ===                           QCD1                             ===
   ===                                                            ===
   ===    Monte-Carlo Simulation of the (3+1)-Dimensional Pure    ===
   ===                 SU(3) Lattice Gauge Theory                 ===
   ===                                                            ===
   ===              Author:    Eckardt Kehl                       ===
   ===              PALLAS GmbH                                   ===
   ===              Hermulheimer Str. 10                          ===
   ===              5040 Bruhl, GERMANY                           ===
   ===     tel.:+49-2232-18960   e-mail:karls@pallas-gmbh.de      ===
   ===                                                            ===
   ===     Copyright: PALLAS GmbH                                 ===
   ===                                                            ===
   ===          Last update: April 1992; Release: 2.0             ===
   ===                                                            ===
   ==================================================================


1. Description
--------------
This benchmark is based on a 'pure gluon' SU(3) lattice gauge theory 
simulation, using the Monte-Carlo heatbath technique. It differs from
the QCD2 benchmark in that it uses the 'quenched' approximation which
neglects dynamical fermions. 

The simulation is defined on a four-dimensional lattice which is a
discrete approximations to continuum space-time. The basic variables
are 3 by 3 complex matrices. Four such matrices are associated with
every lattice site. The lattice update is performed using a multi-hit
Metropolis algorithm.

In the parallel version of the program, the lattice can be distributed
in any one or more of the four lattice directions. 


2. Operating Instructions
-------------------------

File I/O :

The main program routine reads an input file, "qcd1.dat".
Additional information about the meaning of the different input 
parameters is given within the input file. The only parameters that should 
normally be changed to run the standard benchmark are the spatial and
temporal lattice size parameters and the number of processes in the 
distributed version. A permanent record of the benchmark run is saved
in a file called "result". This contains information on the lattice
size and its distribution over processes, plus timing information
and some information on the physical solution for each iteration.
Error messages and some temporary information is output to the
standard output on channel 6.


Changing problem size and numbers of processes:

The lattice size is based on a 4-dimensional space-time lattice of
size:  N = NS **3 * NT, where NT & NS are even integers.

In the sequential version of the program the lattice size is set by changing 
the PARAMETER statements in the include file qcd1.inc

The parallel version of the program is distributed over the spatial
dimensions in a 4-D process grid of size:  

      NP = NPX * NPY * NPZ * NPT

The local lattice size on each processor is then: 

      n = (NS/NPX) * (NS/NPY) * (NS/NPZ) * (NT/NPT)

The parameters NS, NT, NPX, NPY, NPZ & NPT are set in the input data file 
qcd1.dat. 
The values should be chosen such that the dimensions of the local lattice 
are even integers. The maximum number of processes in each dimension
are specified by PARAMETER statements in the include file `qcd1h.inc', if
any of these values (normally 4) are exceeded the program prints an error
message and the program terminates. Similarly the maximum local lattice
dimensions are specified by PARAMETER statements in the include file 
`qcd1n.inc', an error is again notified if any of these maximum dimensions
is exceeded. These maximum values can be changed by altering the PARAMETER
statements, but care must be taken not to exceed the available node memory
as a consequence.


Suggested Problem Sizes :

The smallest possible local lattice size is 2 * 2**3 
- 2**4 lattice points per processor. Though in this case the granularity is 
too small to give efficient distribution or vectorization.

A more reasonable minimum lattice size per processor is 2**6 lattice points,
ie. 8 * 4**3 on an 8 proccessor grid or 4* 8**3 on 32 processors.

The likely maximum local lattice size that will fit into node memory is
around 2**11 local lattice points, ie 32 * 8**3 on an 8 processor grid

It is suggested that the benchmark is run for a range of problem sizes
from the suggested mininmum to the maximum possible lattice size over
the range of number of available processors.


Compiling and Running The Benchmark:

1) Choose problem size and number of processes. In the sequential
   version this is done by editing PARAMETER statements in the file
   qcd1.inc. In the distributed version the problem size and number
   of processes in each dimension is set in the input data file
   qcd1.dat. Upper limits for the numbers of processes are set in the
   include file qcd1h.inc. Similarly the upper limits for the local
   lattice size are set in the file qcd1n.inc. These upper limits may
   be changed but care should be taken not to exceed the available
   node memory.

2) To expand the PARMACS macros, compile and link the code
   with the appropriate libraries, Type:    make

3) If any of the parameters in the include files are changed,
   the code has to be recompiled. The make-file will automatically
   send to the compiler only affected files, Type   make

4) On some systems it may be necessary to allocate the appropriate
   resources before running the benchmark, eg. on the iPSC/860
   to reserve a cube of 8 processors, type:    getcube -t8

5a) To run the sequential benchmark , type:    qcd1

5b) To run the distributed benchmark , type:   host

   This will automatically load both host and node programs. 

   The progress of the benchmark execution can be monitored via
   the standard output, whilst a permanent copy of the benchmark
   output is written to a file called 'result'.

6) If the run is successful and a permanent record is required, the
   file 'result' should be copied to another file before the next run
   overwrites it.




Vectorization:
-------------
The program has been written completely in a vectorizable form.
The vector length equals half the lattice volume. 
The most important subroutines for vectorization are: PRO, STAPLE,
MERTRO, ADD, GATHER, SCATTER and ACCEPT.
