
  SPhot - Monte Carlo Transport Code


    Benchmark-specific Instructions and Constraints

*Privacy & Legal Notice* <http://www.llnl.gov/disclaimer.html>
------------------------------------------------------------------------


    Contents

    * Expected Results 
    * Optimization and Improvement Challenges 
    * Parallelism and Scalability Expectations 
    * Benchmark Specific Instructions 
    * Release and Modification Record 

------------------------------------------------------------------------


    Expected Results

       Please see the Summary writeup and the Sphot Sequoia Specific Benchmark
       tests for information about expected results and the reporting of data.


------------------------------------------------------------------------


    Optimization and Improvement Challenges

Performance improvement and optimizations for this code fall into the
following categories:

   1. *MPI Related*

      Very little data is actually transferred between MPI tasks, since
      each task has its own copy of the mesh, most likely in memory or
      even cache. Most of the inter-task data transfer is for the
      purpose of collecting timing results and involves very little
      data. Several MPI_Barrier calls are used for synchronization
      purposes also.

      Improving MPI related performance would involve minimizing or
      removing all MPI_Barrier calls and finding a way to minimize the
      exchange of inter-task timing statistics. In regards to the
      exchange of timing statistics, task 0 is a definitive bottleneck
      since it must act as the "master" task and communicate with all
      other tasks. This bottleneck may pose a challenge as the problem
      is scaled up to hundreds or thousands of MPI tasks.

      A very important, platform dependent optimization consideration
      for this code is determining the optimal number of MPI tasks to
      use. For example, a cluster of 32-processor SMP machines may
      demonstrate optimal performance when there are 4 MPI tasks
      (running 8 OpenMP threads each) on each SMP machine. This
      optimization will probably best be determined by experimentation
      on a given platform.

   2. *OpenMP Related*

      The current implementation does not make use of THREADPRIVATE
      common blocks. Instead, all threads access certain COMMON block
      variables globally, and that access occurs within the
      computational core routine (execute.f) of the code. For SMPs with
      few processors, this does not seem to pose a performance problem.
      However, it is anticipated that for many-processor SMP machines,
      it may. Improving OpenMP performance would involve finding a means
      to implement THREADPRIVATE common blocks. Currently, the code
      produces "wrong" results if this is attempted.

      As with MPI, determining the optimal number of OpenMP threads to
      use is a very important platform dependent optimization
      consideration.

   3. *Computation Kernel Related*

      Actual execution time for a "run" of SPhot requires very little
      wallclock or CPU time--certainly much less than a minute for any
      current platform. Also, due to the small problem size (grid size),
      and the embarrassingly parallel nature of this code, performance
      gains in this category would be minimal at best.

      For those interested in improving the performance of the
      computationally significant routines, the following might be
      evaluated:

          * execute.f

            This routine is responsible for over 50% of the total
            execution time. Profiling has shown that most of the cycles
            are distributed over a fairly large number of lines that
            perform simple arithmetic operations (multiplies, adds,
            divides), variable assignments (load/store), and IF
            condition testing.

          * pranf.f

            This routine comprises approximately 10% of the total
            execution time. It is a very small routine where virtually
            all of the execution time can be accounted for by a single
            line of code, shown below:


   RandNum = float( Seed( 4 ) ) / Divisor( 4 ) +

1     float( Seed( 3 ) ) / Divisor( 3 ) +

2     float( Seed( 2 ) ) / Divisor( 2 ) +

3     float( Seed( 1 ) ) / Divisor( 1 ) 



          * ranfmodmult.f

            This routine comprises approximately 10% of the total
            execution time. This is another rather trivial (in size)
            routine. The execution time is more or less evenly
            distributed over all lines of executable code (less than 20)
            which mostly perform simple arithmetic and load/store
            operations.

          * sqrt intrinsic function

            This will vary according to platform. On one platform tested
            using the native Fortran compiler, approximately 10% of the
            total execution time was attributed to this intrinsic function.

------------------------------------------------------------------------


    Parallelism and Scalability Expectations

Because SPhot is trivially parallel, it might be expected that this code
should scale "perfectly" to very large processor counts. Additionally,
I/O operations have been replaced with MPI communications to eliminate
scalability problems common to many I/O systems.


When SPhot is executed in hybrid mode (MPI with OpenMP), both shared 
memory and distributed memory communications hardware and software 
performance will be demonstrated.  CPU time per "run" will remain 
relatively constant since runs are intended to map to a processor's 
CPUs. Increasing the number of "runs" which may be done in parallel 
increases with the number of available CPUs. Perfect scalability is 
evidenced by an efficiency calculation of 1.00. That is, there is no 
additional time required to compute N runs on N CPUs.

Realistically, the scalability of SPhot may be affected by the MPI and
OpenMP related factors discussed under the Optimization and Improvement
Challenges Optimization> section. As indicated
there, the application's MPI communications include barriers and
potential task 0 bottlenecks. Also, the OpenMP sharing of global COMMON
block data within the computation loop may also degrade performance for
many-processor SMP machines.

------------------------------------------------------------------------


    Benchmark Specific Instructions and Constraints

   1. *Code Modifications*

      For the purposes of the ASC benchmarks, modifications to this
      application's source code are not permitted unless such
      modifications are:
          * Needed to modify certain default parameters as discussed in
            the Building the Code section.
          * Minor and not intended for optimization purposes. Permitted
            modifications would include those required to overcome
            platform specific obstacles to building or running the code.
            These modifications should be documented and reported to the
            ASCI benchmark point of contact. 

   2. *Modifications to Input File Parameters*

      As discussed in the Running the Code 
      section previously, the *Nruns* parameter will almost invariably
      need to be modified every time the number of MPI tasks / OpenMP
      threads used in an execution changes. This input parameter and the
      output print flag are the only two parameters which can be modified.

      Note that this parameter must be at least equal to and evenly
      divisible by the (Number of MPI tasks * Number of OpenMP threads)
      used. It has a default maximum value of 10001. If this default
      value is too small, consult the Building the Code
      section for instructions on how to
      increase the value of the Nruns parameter.

   3. *Compiler Generated Optimizations*

      Benchmarkers are welcomed and encouraged to employ the use of
      compiler generated optimizations when building the code.
      Specifying these optimizations should not require modification to
      the source code, but instead should take the form of common
      compiler "flags". These types of optimizations may be specified in
      the two Makefiles used for that purpose. See Building the Code
      for additional information.

   4. *Maximum Number of MPI Tasks and OpenMP Threads*

      For the purpose of array dimensioning, particularly for the
      collection of timing statistics, maximum values for these two
      parameters are defined in the file includes/times.inc. The
      distribution software sets these as follows:

      PARAMETER( maxMPItasks = 16384 )
      PARAMETER( maxThreadsPerMPItask = 128 )

      If these values need to be increased, see the Building the Code
      section for instructions.

   5. *Required Problems*

     Please see the Sphot Sequoia Specific Problem set for the problems
     to be run, the results to be reported, and the description of the
     calculation of the Figure of Merit (FOM) for the runs.
------------------------------------------------------------------------


    Release and Modification Record

Version 1.0, this version. No release required. Public domain software.
------------------------------------------------------------------------

For more information about this page, contact:
Tom Spelce ,  spelce1@llnl.gov mailto:spelce1@llnl.gov>

<http://www.llnl.gov/>

*UCRL-MI-144211*
September 19, 2001

