   =================================================
   ===                                           ===
   ===   GENESIS Distributed Memory Benchmarks   ===
   ===                                           ===
   ===                   LPM1                    ===
   ===                                           ===
   ===   Program: local particle mesh benchmark  ===
   ===   Version:   PARMACS Fortran 77           ===
   ===   Author:    Roger W. Hockney             ===
   ===   Update:    March 1992; Release: 2.0     ===
   ===                                           ===
   =================================================

    This run started 13-May-92 00:08:35
    Run on iPSC/860 at Daresbury
    Compiler version :iPSC_FORTRAN i860 Rel. 3.0
    Operating system :iPSC System Software Rel. 3.3.2
    Benchmarker was Vladimir Getov

 this benchmark is the simulation of an electronic device
 using a particle-mesh (pm) method, often also called a
 particle-in-cell (pic) simulation. in each timestep the
 electric and magnetic fields on an (lmax x mmax) mesh are
 advanced explicitly in time using maxwell's equations, and
 the particles (electrons) are advanced in the fields using
 newton's equations. 

 the benchmark is described as local because the time scale
 is such that the fields may be computed explicitly, using
 fields only local to each mesh point. four benchmark cases
 are provided (nben3=1,2,3,4), giving four problem sizes
 described by the size factor alpha=1,2,4,8 and mesh numbers
 (75*alpha,33). the number of particles at the end of the
 run of 1 picosecond is given empirically by
                  628*alpha**1.172.  

 as the number of mesh-points increases for the same physical
 dimension, the time-step must be reduced to satisfy the cfl
 stability criterion.  this effect has an important influence
 on the meaning of the performance metrics. the performance
 is expressed in several different metrics (and units) for
 comparison purposes.  as well as the traditional speedup and
 efficiency, we give the temporal (tstep/s), simulation
 (sim-ps/s), and benchmark (mflop/s(lpm1)) performance, which
 are much more meaningful and useful measures.

 parallelisation is by one-dimensional domain decomposition,
 in the first coordinate. each processor is responsible for
 a slab of space, and stores the mesh-ponts and coordinates
 of particles in its region of space. during each timestep
 particle coordinates are transferred between processors as
 the particles move from region to region.

          basic run parameters
          ====================

 case number,          nben3 =   1
 problem size factor,  alpha =   1
     mesh points in z,  lmax =  75
     mesh points in r,  mmax =  33
 number of processors,   p   =   3
 number of timesteps,   nrun = 104

          ====================

 error check
 -----------
 because the simulation uses random numbers, the multi-processor
 calculation cannot be expected to give identical results to the
 uni-processor calculation. however, the percentage difference
 in particle number, np, and average b-field, bav,  in the last
 timestep, should not exceed a few percent:

     number particles,   np  =      628 (1-proc)      627 (p-proc)
     average b-field,   bav  = 1.97E-02 (1-proc) 1.98E-02 (p-proc)

     % difference np =  -0.159      % difference bav =   0.622
calculations are accepted if differences < 10%


 $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

        benchmark calculation acceptable
            you may use the results

 $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$


 temporal performance
 --------------------
 temporal performance is the inverse of the execution time,
 here expressed in units of timestep per second (tstep/s).
 this is the fundamental metric of performance, because it
 is in absolute units and one can guarantee that the code with
 the highest temporal performance executes in the least time:

  uniprocessor   time,          t1p =     14.9453 s
  multiprocessor time,          tnp =     12.0742 s
                                -------------
  temporal perf. (1-proc) = t1p**-1 =      6.9587 tstep/s
  temporal perf. (n-proc) = tnp**-1 =      8.6134 tstep/s
                              ratio =      1.2378


 speedup and efficiency
 ----------------------
 speedup, sp, has the traditional definition of the ratio of
 1-proc to n-proc. execution time, and efficiency, ep, is
 speedup per processor. because speedup is a relative
 measure, the program with the highest speedup may not
 execute in the least time! be warned.

   uniprocessor   time,  t1p =      14.945 s
   multiprocessor time,  tnp =      12.074 s
                             -------------
    speedup,    sp = t1p/tnp =       1.238
    efficiency, ep = sp/p    =      41.260%
                             -------------

 simulation performance
 ----------------------
 this metric measures the amount of simulated time computed
 in one real wall-clock second. it is the most meaningful
 metric for a simulation because it is what the user actually
 wishes to maximise. for this benchmark, the units are 
 simulated picosecond per second (sim-ps/s). in this metric
 larger problems with more mesh points run slower (which in
 fact they do), although they generate more speedup and
 mflop/s! this metric also includes the fact problems with
 a smaller space step often must use a smaller timestep,
 and therefore take more timesteps to cover the same amount
 of simulated time 

    timestep,   dt =       9.558 ps
    simulated time  =  nrun*dt =   994.070 sim-ps;  requested =   1.000 ns
    simulation performance(1-proc) =      66.514 sim-ps/s
    simulation performance(n-proc) =      82.330 sim-ps/s
                             ratio =       1.238


 benchmark performance
 ---------------------
 this metric is calculated from the nominal number of
 floating-point operations needed to perform the benchmark
 on a single processor.  for the one-nanosecond benchmark
 setup here, the average number of floating-point operations
 per timestep is defined to be:
         f_b(alpha) = 46*75*33*alpha + 58*628*alpha**1.172
 where the size factor alpha=1,2,4,8 for cases nben3=1,2,3,4.
 the first term above is the work to update the fields on the
 mesh, and the second term is the work to move the particles.
 then the benchmark performance is
         r_b(alpha,p) = f_b(alpha)/tp(alpha,p)
 performance calculated in this way has the units 
 mflop/s(lpm1). different parallel implementations may,
 in fact, perform more or fewer operations than the above, but
 they are only credited with the number given by the formula.
 because f_b is fixed for all codes, we can quarantee that the
 code with the highest benchmark performance executes in the
 least time.

    floating-point operations per timestep:
       mesh = 0.114E+06  particles = 0.364E+05  total = 0.150E+06 flop
    floating-point operations per second (all steps):
       benchmark performance(1-proc) =       1.046 mflop/s(lpm1)
       benchmark performance(n-proc) =       1.294 mflop/s(lpm1)
                               ratio =       1.238


 **********************************************************
 *                  performance summary                   *
 **********************************************************
 *             particle-mesh (pic) simulation             *
 *       of one nanosecond of electronic device time      *
 *         parallelised by 1d domain decomposition        *
 *--------------------------------------------------------*
 *                                                        *
 *                     n-proc      1-proc                 *
 *   elap. time :      12.074      14.945 s               *
 *   numb. step :   104 tstep           *
 *     temporal :       8.613       6.959 tstep/s         *
 *     speedup  :       1.238                             *
 *   efficiency :      41.260 %                           *
 *   simulation :      82.330      66.514 sim-ps/s        *
 *   benchmark  :       1.294       1.046 mflop/s(lpm1)   *
 *                                                        *
 *--------------------------------------------------------*
 *   tstep/s    -  timestep per second                    *
 *   sim-ps/s   -  simulated picosec per second           *
 *   mflop/s    -  10**6 floating-point op. per second    *
 **********************************************************


