# Parallelism and HPC Julia¶

Julia's documentation: http://docs.julialang.org/en/release-0.5/manual/parallel-computing/

In this notebook we will go over the different aspects of parallelism present in Julia.

## SIMD¶

SIMD, Single Instruction Multiple Data, is a form of parallelism from executing multiple similar commands at once using specialized instructions in the processor. Julia applies SIMD automatically to loops and some functions due to the -O3 optimization in its JIT compilation. However, SIMD can be explicitly added to a loop by using the @simd macro (note, this may slow down the calcuation. It is wise to let the auto-optimizer apply SIMD, and finer control of SIMD can be achieved using the SIMD.jl library.

Another form of multiple instruction is fused multiply add for calculations of type a*b+c. There are two forms present in Julia. The first is muladd. This is the recommended form for performance. muladd(a,b,c) will only apply a fused multiplication/addition if it will help with performance. On the otherhand, fma(a,b,c) will always apply fused multiplication/addition.

As of Julia v0.5, experimental multithreading is native to Julia via the Threads.@threads macro.

## Distributed Parallelism¶

Julia's native parallelism is a distributed form of parallelism through TCPIP/ssh.

### Libraries¶

The following libraries are helpful for solving parallel problems:

• DistributedArrays.jl
• ParallelDataTransfer.jl

# Projects¶

## Project 1: Getting Started with Distributed Parallelism¶

Use the following tutorial to test Julia's distributed parallelism:

http://www.stochasticlifestyle.com/multi-node-parallelism-in-julia-on-an-hpc/

## Project 2: Coding a Distributed Algorithm¶

Extend your least_squares implementation to a distributed algorithm.

• Now generate a much larger X and y
• Use DistributedArrays.jl or ParallelDataTransfer.jl to evenly split the data amongst worker processes
• Apply the @spawnat macro to use the least_squares function on the remote processes
• Retrive the results of the least_squares algorithm, and average them together
• Now try a different approach using pmap

### Benchmarks¶

• Benchmark the two codes on your computer (or cluster!). How does the performance scale with the number of processes? Look at http://www.stochasticlifestyle.com/236-2/
• Try to make a multithreaded version of the algorithm. How well does it benchmark? Check for type instabilities!

# Extras¶

## Job Scripts¶

### UC Irvine Cluster (SGE)¶

In [ ]:
#!/bin/bash

#$-N jbtest #$ -q <Queue>
#$-pe mpich 128 #$ -cwd            		# run the job out of the current directory
#$-m beas #$ -ckpt blcr
#$-o output/ #$ -e output/
julia --machinefile jbtest-pe_hostfile_mpich.$JOB_ID test.jl  ### XSEDE Comet (Slurm) Job Script¶ In [ ]: #!/bin/bash #SBATCH -A <account> #SBATCH --job-name="juliaTest" #SBATCH --output="juliaTest.%j.%N.out" #SBATCH --partition=compute #SBATCH --nodes=8 #SBATCH --export=ALL #SBATCH --ntasks-per-node=24 #SBATCH -t 01:00:00 export SLURM_NODEFILE=generate_pbs_nodefile ./julia --machinefile$SLURM_NODEFILE /home/crackauc/test.jl

## Script which prints out the hostnames of the worker processes
run(hostname) end