
   ==================================================================
   ===                                                            ===
   ===           GENESIS Distributed Memory Benchmarks            ===
   ===                                                            ===
   ===                           QCD2                             ===
   ===                                                            ===
   ===     Conjugate Gradient iteration in SU(3) lattice gauge    ===
   ===              theory with Kogut-Susskind fermions           ===
   ===                                                            ===
   ===     Original author:        John Merlin                    ===
   ===     Modified by    :        Ivan Wolton                    ===
   ===     PARMACS macros :        Vladimir Getov                 ===
   ===     Department of Electronics and Computer Science         ===
   ===               University of Southampton                    ===
   ===               Southampton SO9 5NH, U.K.                    ===
   ===     fax.:+44-703-593045   e-mail:icw@uk.ac.soton.ecs       ===
   ===                                  vsg@uk.ac.soton.ecs       ===
   ===                                                            ===
   ===     Copyright: SNARC, University of Southampton            ===
   ===                                                            ===
   ===          Last update: October 1991; Release: 2.0           ===
   ===                                                            ===
   ==================================================================


1. Description
--------------
This benchmark consists of solving a large, sparse system of linear
equations using conjugate gradient iteration. The equations are
derived from a lattice gauge theory simulation using dynamical Kogut-Susskind
fermions. Conjugate gradient methods form the core of several
important algorithms for lattice gauge theory with fermions.
Supercomputer performance is essential for such problems as the
inclusion of dynamical fermions increases the computational effort
required by several orders of magnitude over the 'quenched'
approximation. (The quenched approximation is used in the QCD1
benchmark)

Simulations are defined on four-dimensional lattices which are
discrete approximations to continuum space-time. The basic variables
are 3 by 3 complex matrices. Four such matrices are associated with
every lattice site.

The benchmark takes the common approach of updating the variables on
all even sites, and then on all odd sites, on alternate steps. Updating a
site variable requires a number of matrix multiplications
and involves matrices from neighbouring sites. Almost all
the arithmetic operations are vectorizable. However, in order to
achieve this vectorization an overhead is incurred in internal shifts
of neighbouring matrices, which can become a significant part of the 
execution time.

The parallel version of the program distibutes the spatial
dimension of the lattice over a cuboidal process grid.
Communications involve both the shifting of matrices from neighbouring
processors and a global summation followed by a broadcast. 


2. Operating Instructions
-------------------------

Changing problem size and number of processors:

The lattice size is based on a 4-dimensional space-time lattice of
size:  N = Nt * Nx**3, where Nt & Nx are even integers.
The parallel version of the program is distributed over the spatial
dimensions in a cuboidal process grid of size:  P = Px * Py * Pz
The local lattice size on each processor is then: 

    n = Nt * (Nx/Px) * (Nx/Py) * Nx/Pz)

There is no input data file. The lattice and processor grid dimensions
are specified by PARAMETER statements in an INCLUDE file (qcd2.inc). 
The values of Nt, Nx, Px, Py & Pz should be chosen such that the 
dimensions of the local lattice size are even integers. All other
parameters are set explicitly in a call to the subroutine setpar.


Suggested Problem Sizes :

The smallest possible local lattice size is 2 * 2**3 
- 2**4 lattice points per processor. Though in this case the granularity is 
too small to give either efficient distribution or vectorization.

A more reasonable minimum lattice size per processor is 2**6 lattice points,
ie. 8 * 4**3 on an 8 proccessor grid or 4* 8**3 on 32 processors.

The likely maximum local lattice size that will fit into node memory is
around 2**11 local lattice points, ie 32 * 8**3 on an 8 processor grid

It is suggested that the benchmark is run for a range of problem sizes
from the suggested miminmum to the maximum possible lattice size over
the range of number of available processors.


Compiling and Running the Benchmark:

1) Choose problem size and number of processors, edit the include file 
   qcd2.inc to set the appropriate parameters.

2) To expand the PARMACS macros, compile and link the code
   with the appropriate libraries, Type:    make 

3) If any of the parameters in the include files are changed,
   the code has to be recompiled. The make-file will automatically
   send to the compiler only affected files, Type   make

4) On some systems it may be necessary to allocate the appropriate
   resources before running the benchmark, eg. on the iPSC/860
   to reserve a cube of 8 processors, type:    getcube -t8

5a) To run the sequential benchmark , type:    qcd2

5b) To run the distributed benchmark , type:   host

   This will automatically load both host and node programs. 

   The progress of the benchmark execution can be monitored via
   the standard output, whilst a permanent copy of the benchmark output
   is written to a file called 'result'.

6) If the run is successful and a permanent record is required, the
   file 'result' should be copied to another file before the next run
   overwrites it.


3. Hints for Optimisation (Blockshift versus indirect addressing)
-----------------------------------------------------------------
Two routines are provided for the shift operation, blockshift and
shiftvec. Blockshift shifts coherent blocks corresponding to a given
lattice direction. The problem is that the block length is rather
small for the t & x directions and so with large vector startups
the vector efficiency is poor. Hence in these directions it is more 
efficient to use the indirect addressing version of the shift routine 
shiftvec. 

The shift routines are called from the routine dvec which by default 
uses shiftvec in the t and x directions and blockshift in the y & z 
directions. For best performance with smaller lattices, blockshift should 
be used in the y direction.



4. Accuracy Check
-----------------
The output results are best characterised by the total energy per
lattice point (output in columnn 3). The program can be considered to
have run successfully if the following two conditions are met.

1) The total energy should be constant to 5 decimal places for each
   iteration (a small variation in the final 6th place is allowable). 

2) This constant value should be close to 3.0 

Unfortunately it is difficult to be more precise as the fermion and gauge 
fields are initialised by a random number generator. Consequently the exact
value of the total energy is then dependent on the number of processors and 
the problem size. 
