Parallelise Relate
Relate has been coded such that it can be parallelised on hundreds of cores on large computing clusters. We provide a script for running Relate on a SGE cluster here. For smaller datasets, where it is sufficient to parallelise over only a few cores, we provide a bash script that parallelises Relate without having to submit many jobs.
Parallelised Relate
We provide a bash script that parallelises Relate on multiple cores. The script can be found under
PATH_TO_RELATE/scripts/RelateParallel/RelateParallel.sh
The script takes the same arguments (and same optional arguments) as Relate, and one additional argument:
- --threads
- Maximum number of threads. Default is 1. Note that the script only parallelises Relate if the input data set is sufficiently large. Otherwise, it is more efficient to run Relate on a single core, and Relate will automatically do so. If you set --max_memory to a smaller value, Relate will tend to parallelise over more cores.
PATH_TO_RELATE/scripts/RelateParallel/RelateParallel.sh\
-m 1.25e-8 \
-N 30000 \
--haps data/example.haps \
--sample data/example.sample \
--map data/genetic_map.txt \
--annot data/example.annot \
--seed 1 \
-o example \
--threads 8
Output: example.anc, example.mutRelate on a SGE cluster
We provide a bash script for running Relate on a Sun Grid Engine (SGE) cluster. The script can be found under
PATH_TO_RELATE/scripts/RelateSGE/RelateSGE.sh
The script takes the same arguments (and same optional arguments) as Relate, and some additional arguments that can be directly mapped to the corresponding SGE arguments:
- -P
- Assign job to the project.
- -q
- Queue.
- --pe
- Optional. Processor count per node. Default is 'shmem 1'.
PATH_TO_RELATE/scripts/RelateSGE/RelateSGE.sh\
-m 1.25e-8 \
-N 30000 \
--haps data/example.haps \
--sample data/example.sample \
--map data/genetic_map.txt \
--annot data/example.annot \
--seed 1 \
-o example \
-P myers.prjc \
-q short.qc
Output: example.anc, example.mutStages of Relate
To parallelise Relate, do not use mode All. Instead we have to execute each of the 7 stages in sequence:
- MakeChunks: Divides the chromosome into chunks. To each chunk, we apply stages 2-6.
- Paint: Applies the chromosome painting algorithm and stores distance matrices as temporary files.
- BuildTopology: Uses distance matrices calculated in the previous stage and estimates tree topologies.
- FindEquivalentBranches: Looks for equivalent branches in adjacent trees.
- InferBranchLengths: Estimates branch lengths for the trees.
- CombineSections: Combines files for the chunk.
- Finalize: Combines all chunks and generates the output.
- output [string]: Filename of output without file extension.
- mu [dbl]: mutation rate.
- Ne [dbl]: effective population size of haplotypes. (NOT of individuals! To get the population size of haplotypes, multiply the effective population size of individuals by 2.)
1. Make chunks
- Argument --haps, --sample, and --map take the filenames of the input data.
- Optional argument --dist takes the filename of a .dist file. A .dist file specifies distances (in BP) between SNPs. This is to correct for regions with low mappability in the genomic mask (A tool that takes a genomic mask and generates a .dist file can be found here).
- Optional argument --memory option specifies the approximate memory allowance (in GB) used for storing the distance matrices in memory. Relate will exceed this amount, as it also stores trees and other information in memory; however the option allows to approximately control memory usage. Default is 5GB. Relate can become more efficient (in runtime and hard disk usage) with more memory, particularly for large sample sizes.
PATH_TO_RELATE/bin/Relate \
--mode "MakeChunks" \
--haps data/example.haps \
--sample data/example.sample \
--map data/genetic_map.txt \
-o ${output}
Output: temporary filesFor loop
num_chunks=$(ls parameters_* | wc -l)
Then, apply steps 2 - 6 to each chunk
for chunk_index in `seq 0 $(($num_chunks - 1))`;
do
#stages 2-6
done
This for loop can be parallelised without moving any files or changing directories.
2. Paint
- Argument --chunk_index specifies the index [int] of a chunk.
PATH_TO_RELATE/bin/Relate \
--mode Paint \
--chunk_index ${chunk_index} \
-o ${output}
Output: temporary files3. Build Topology
- Argument --chunk_index specifies the index [int] of a chunk.
- Optional arguments --first_section, --last_section specify the first and last section within the chunk, for which to infer tree topologies. With these options, we can parallelise this mode (see below).
- Argument -o specifies the output filename without file extension.
PATH_TO_RELATE/bin/Relate \
--mode BuildTopology \
--chunk_index ${chunk_index} \
-o ${output}
Output: temporary files
Optional:
You can code another parallelisable for loop here. To do this, you need to specify options first_section and last_section. First obtain the number of sections for this chunk. In bash, this can be done as follows
num_sections=$(ls chunk_${chunk_index}/paint/*bin | wc -l)
You can now code a for loop,
for section in `seq 0 $(($num_sections - 1))`;
do
PATH_TO_RELATE/bin/Relate \
--mode BuildTopology \
--first_section $section \
--last_section $section \
--chunk_index ${chunk_index} \
-o ${output}
done
Here --first_section and --last_section are both set to $section, such that only trees in one section will be inferred at a time. You can also specify more than one section here.
This for loop can be parallelised without moving any files or changing directories.
4. Find equivalent branches in adjacent trees and propagate mutations
- Argument --chunk_index specifies the index [int] of a chunk.
- Argument -o specifies the output filename without file extension.
PATH_TO_RELATE/bin/Relate \
--mode "FindEquivalentBranches" \
--chunk_index ${chunk_index} \
-o ${output}
Output: temporary files5. Estimate branch lengths
- Argument -m specifies the mutation rate [dbl]
- Argument -N specifies the effective population size [dbl]
- Argument --chunk_index specifies the index ([integer]) of a chunk.
- Optional arguments --first_section, --last_section specify the first and last section within the chunk, for which to infer branch lengths. With these options, we can parallelise this mode (similar to stage 3 of the algorithm).
- Argument -o specifies the output filename without file extension.
PATH_TO_RELATE/bin/Relate \
--mode "InferBranchLengths" \
-m $mu \
-N $Ne \
--chunk_index ${chunk_index} \
-o ${output}
Output: temporary filesOptional:
Similarly to stage 3, you can code a for loop,
for section in `seq 0 $(($num_sections - 1))`;
do
PATH_TO_RELATE/bin/Relate \
--mode "InferBranchLengths" \
-m $mu \
-N $Ne \
--first_section $section \
--last_section $section \
--chunk_index ${chunk_index} \
-o ${output}
done
Here --first_section and --last_section are both set to $section, such that only trees in one section will be inferred at a time. You can also specify more than one section here. Optionally, you can specify the --coal option, with which we specify coalescence rates through time in form of a .coal file. See here for file format. Specifying this option will overwrite the --effectiveN option.
This for loop can be parallelised without moving any files or changing directories.
6. Combine files
- Argument -N specifies the effective population size [dbl]
- Argument --chunk_index specifies the index [int] of a chunk.
- Argument -o specifies the output filename without file extension.
PATH_TO_RELATE/bin/Relate \
--mode "CombineSections" \
-N $Ne \
--chunk_index ${chunk_index} \
-o ${output}
Output: temporary files7. Finalize output
- Argument -o specifies the output filename without file extension.
PATH_TO_RELATE/bin/Relate \
--mode "Finalize" \
-o ${output}
Output: ${output}.anc, ${output}.mut