Parallelise Relate


Relate has been coded such that it can be parallelised on hundreds of cores on large computing clusters. We provide a script for running Relate on a SGE cluster here. For smaller datasets, where it is sufficient to parallelise over only a few cores, we provide a bash script that parallelises Relate without having to submit many jobs.

Parallelised Relate

We provide a bash script that parallelises Relate on multiple cores. The script can be found under

PATH_TO_RELATE/scripts/RelateParallel/RelateParallel.sh
The script takes the same arguments (and same optional arguments) as Relate, and one additional argument:
--threads
Maximum number of threads. Default is 1. Note that the script only parallelises Relate if the input data set is sufficiently large. Otherwise, it is more efficient to run Relate on a single core, and Relate will automatically do so. If you set --max_memory to a smaller value, Relate will tend to parallelise over more cores.
Do not move the script - we are extracting the path to the binaries from the location of the script!

 PATH_TO_RELATE/scripts/RelateParallel/RelateParallel.sh\
                  -m 1.25e-8 \
                  -N 30000 \
                  --haps data/example.haps \
                  --sample data/example.sample \
                  --map data/genetic_map.txt \
                  --annot data/example.annot \
                  --seed 1 \
                  -o example \
                  --threads 8 
Output: example.anc, example.mut

Relate on a SGE cluster

We provide a bash script for running Relate on a Sun Grid Engine (SGE) cluster. The script can be found under

PATH_TO_RELATE/scripts/RelateSGE/RelateSGE.sh
The script takes the same arguments (and same optional arguments) as Relate, and some additional arguments that can be directly mapped to the corresponding SGE arguments:
-P
Assign job to the project.
-q
Queue.
--pe
Optional. Processor count per node. Default is 'shmem 1'.
Do not move the script - we are extracting the path to the binaries from the location of the script!

 PATH_TO_RELATE/scripts/RelateSGE/RelateSGE.sh\
                  -m 1.25e-8 \
                  -N 30000 \
                  --haps data/example.haps \
                  --sample data/example.sample \
                  --map data/genetic_map.txt \
                  --annot data/example.annot \
                  --seed 1 \
                  -o example \
                  -P myers.prjc \
                  -q short.qc
Output: example.anc, example.mut

Stages of Relate


We implemented the algorithm such that it can be parallelised in multiple ways and, importantly, we can parallelise by submitting hundreds of independent jobs on a computing cluster.
To parallelise Relate, do not use mode All. Instead we have to execute each of the 7 stages in sequence:
  1. MakeChunks: Divides the chromosome into chunks. To each chunk, we apply stages 2-6.
  2. Paint: Applies the chromosome painting algorithm and stores distance matrices as temporary files.
  3. BuildTopology: Uses distance matrices calculated in the previous stage and estimates tree topologies.
  4. FindEquivalentBranches: Looks for equivalent branches in adjacent trees.
  5. InferBranchLengths: Estimates branch lengths for the trees.
  6. CombineSections: Combines files for the chunk.
  7. Finalize: Combines all chunks and generates the output.
Let us define the following parameters:
  • output [string]: Filename of output without file extension.
  • mu [dbl]: mutation rate.
  • Ne [dbl]: effective population size of haplotypes. (NOT of individuals! To get the population size of haplotypes, multiply the effective population size of individuals by 2.)

1. Make chunks

  • Argument --haps, --sample, and --map take the filenames of the input data.
  • Optional argument --dist takes the filename of a .dist file. A .dist file specifies distances (in BP) between SNPs. This is to correct for regions with low mappability in the genomic mask (A tool that takes a genomic mask and generates a .dist file can be found here).
  • Optional argument --memory option specifies the approximate memory allowance (in GB) used for storing the distance matrices in memory. Relate will exceed this amount, as it also stores trees and other information in memory; however the option allows to approximately control memory usage. Default is 5GB. Relate can become more efficient (in runtime and hard disk usage) with more memory, particularly for large sample sizes.

 PATH_TO_RELATE/bin/Relate \
                  --mode "MakeChunks" \
                  --haps data/example.haps \
                  --sample data/example.sample \
                  --map data/genetic_map.txt  
Output: temporary files

For loop

First, obtain the number of chunks created in the previous step. In bash, this can be done as follows

num_chunks=$(ls parameters_* | wc -l)

Then, apply steps 2 - 6 to each chunk

 for chunk_index in `seq 0 $(($num_chunks - 1))`;
                  do
                    #stages 2-6
                  done

This for loop can be parallelised without moving any files or changing directories.

2. Paint

  • Argument --chunk_index specifies the index [int] of a chunk.

 PATH_TO_RELATE/bin/Relate \
                  --mode Paint \
                  --chunk_index ${chunk_index} 
Output: temporary files

3. Build Topology

  • Argument --chunk_index specifies the index [int] of a chunk.
  • Optional arguments --first_section, --last_section specify the first and last section within the chunk, for which to infer tree topologies. With these options, we can parallelise this mode (see below).
  • Argument -o specifies the output filename without file extension.

 PATH_TO_RELATE/bin/Relate \
                  --mode BuildTopology \
                  --chunk_index ${chunk_index} \
                  -o ${output} 
Output: temporary files

Optional:
You can code another parallelisable for loop here. To do this, you need to specify options first_section and last_section. First obtain the number of sections for this chunk. In bash, this can be done as follows

num_sections=$(ls chunk_${chunk_index}/paint/*bin | wc -l)

You can now code a for loop,
 for section in `seq 0 $(($num_sections - 1))`;
                  do
                    PATH_TO_RELATE/bin/Relate \
                        --mode BuildTopology \
                        --first_section $section \
                        --last_section $section \
                        --chunk_index ${chunk_index} \
                        -o ${output} 
                  done
Here --first_section and --last_section are both set to $section, such that only trees in one section will be inferred at a time. You can also specify more than one section here.

This for loop can be parallelised without moving any files or changing directories.

4. Find equivalent branches in adjacent trees and propagate mutations

  • Argument --chunk_index specifies the index [int] of a chunk.
  • Argument -o specifies the output filename without file extension.

 PATH_TO_RELATE/bin/Relate \
                  --mode "FindEquivalentBranches" \
                  --chunk_index ${chunk_index} \
                  -o ${output} 
Output: temporary files

5. Estimate branch lengths

  • Argument -m specifies the mutation rate [dbl]
  • Argument -N specifies the effective population size [dbl]
  • Argument --chunk_index specifies the index ([integer]) of a chunk.
  • Optional arguments --first_section, --last_section specify the first and last section within the chunk, for which to infer branch lengths. With these options, we can parallelise this mode (similar to stage 3 of the algorithm).
  • Argument -o specifies the output filename without file extension.

 PATH_TO_RELATE/bin/Relate \
                  --mode "InferBranchLengths" \
                  -m $mu \
                  -N $Ne \
                  --chunk_index ${chunk_index} \
                  -o ${output} 
Output: temporary files

Optional:
Similarly to stage 3, you can code a for loop,
 for section in `seq 0 $(($num_sections - 1))`;
                  do
                    PATH_TO_RELATE/bin/Relate \
                        --mode "InferBranchLengths" \
                        -m $mu \
                        -N $Ne \
                        --first_section $section \
                        --last_section $section \
                        --chunk_index ${chunk_index} \
                        -o ${output} 
                  done
Here --first_section and --last_section are both set to $section, such that only trees in one section will be inferred at a time. You can also specify more than one section here. Optionally, you can specify the --coal option, with which we specify coalescence rates through time in form of a .coal file. See here for file format. Specifying this option will overwrite the --effectiveN option.

This for loop can be parallelised without moving any files or changing directories.

6. Combine files

  • Argument -N specifies the effective population size [dbl]
  • Argument --chunk_index specifies the index [int] of a chunk.
  • Argument -o specifies the output filename without file extension.

 PATH_TO_RELATE/bin/Relate \
                  --mode "CombineSections" \
                  -N $Ne \
                  --chunk_index ${chunk_index} \
                  -o ${output} 
Output: temporary files

7. Finalize output

  • Argument -o specifies the output filename without file extension.

 PATH_TO_RELATE/bin/Relate \
                  --mode "Finalize" \
                  -o ${output} 
Output: ${output}.anc, ${output}.mut