Input data file format
The input file format for Relate is the haps/sample file format, which is used as an output file format by SHAPEIT2 (see here).
Recombination rates are specified using a genomic map.
For some add-on modules of Relate, we also want to specify population labels. The .poplabels file format is identical to the sample file in the hap/legend/sample file format.
Convert to haps/sample file format
Convert from hap/legend/sample
- --haps
- Filename of .haps file.
- --sample
- Filename of .sample file
- -i,--input
- Filename (without file extension) of the hap/legend/sample files. The input files can be gzipped; we look for files with extensions hap,hap.gz/legend,legend.gz/sample,sample.gz, with preference given to the non-gzipped file.
- --chr
- Optional: specifies a chromosome index [int] (used as the first column in the haps file). Default: 0.
PATH_TO_RELATE/bin/RelateFileFormats \
--mode ConvertFromHapLegendSample \
--haps example.haps \
--sample example.sample \
-i example \
--chr 1
Output: example.haps, example.sample.
Convert from vcf
- --haps
- Filename of .haps file.
- --sample
- Filename of .sample file
- -i,--input
- Filename (without file extension) of the vcf file. The input files can be gzipped; we look for files with extensions hap,hap.gz/legend,legend.gz/sample,sample.gz, with preference given to the non-gzipped file.
- --chr
- Optional: specifies a chromosome index [int] (used as the first column in the haps file). Default: 0.
PATH_TO_RELATE/bin/RelateFileFormats \
--mode ConvertFromVcf \
--haps example.haps \
--sample example.sample \
-i example
Output: example.haps, example.sample.
Convert from ms output format
The script removes any sites with multiple mutations.
If the input file contains more than one simulation, these get separated into "chromosomes".
Download the script.
Rscript ms2haps.R infile.ms example nsites
- infile.ms
- Input filename with file extension.
- example
- Output filename withouth file extension.
- nsites
- (Optional) Number of simulated sites. Default value is 1. This is multiplied to the positions.
Preparing input files
The following code can be used to prepare the input data. We recommend to apply all steps (if applicable) in the listed order.
All input files can be gzipped.
Prepate input files
See here for links to external resources, such as an ancestral genome.
- --haps
- Filename of .haps file.
- --sample
- Filename of .sample file
- -o,--output
- Filename of output files without file extensions. If same as input filenames, input files will be overwritten.
- --ancestor
- Fasta file containing ancestral genome. This is case insensitive.
- --mask
- Optional: Fasta file of same length as the ancestral genome containing a genomic mask. Loci passing the mask are denoted by P, loci not passing the mask are denoted by N. This is case insensitive.
- --remove_ids
- Optional: File containing ids of samples that will be removed from the haps/sample files. One id per line.
- --poplabels
- Optional: File containing population labels of samples.
PATH_TO_RELATE/scripts/PrepareInputFiles/PrepareInputFiles.sh \
--haps example.haps \
--sample example.sample \
--ancestor ancestor.fa \
--mask mask.fa \
--remove_ids remove_ids.txt \
--poplabels example.poplabels
-o example_input
Output: example_input.haps, example_input.sample, example_input.dist (if --mask if specified), example_input.poplabels (if remove_ids and poplabels is specified), example_input.annot (if poplabels is specified).
Remove non-biallelic SNPs
PATH_TO_RELATE/bin/RelateFileFormats \
--mode RemoveNonBiallelicSNPs \
--haps example.haps \
-o example_biallelic
Output: example_biallelic.haps.
Determine ancestral allele and flip if necessary
PATH_TO_RELATE/bin/RelateFileFormats \
--mode FlipHapsUsingAncestor \
--haps example.haps \
--sample example.sample \
--ancestor ancestor.fa \
-o example_ancestral
Output: example_ancestral.haps
Remove samples from data set
PATH_TO_RELATE/bin/RelateFileFormats \
--mode RemoveSamples \
--haps example.haps \
--sample example.sample \
--poplabels example.poplabels \
-i remove_ids.txt \
-o example_subset
Output: example_subset.haps, example_subset.sample, example_subset.poplabels (if --poplabels is specified).
Filter SNPs using a genomic mask
PATH_TO_RELATE/bin/RelateFileFormats \
--mode FilterHapsUsingMask\
--haps example.haps \
--sample example.sample \
--mask genomic_mask.fa \
-o example_mask
Output: example_mask.haps, example_mask.dist.
Generate SNP annotations
PATH_TO_RELATE/bin/RelateFileFormats \
--mode GenerateSNPAnnotations \
--haps example.haps \
--sample example.sample \
--poplabels example.poplabels \
--ancestor ancestor.fa \
-o example
Output: example.annot. If --mut is specified, columns are appended to example.mut.
Convert output to other formats
The following code can be used to convert the output to other file format
Convert to tree sequence file format (tskit)
This function converts anc/mut files inferred by Relate into the tree sequence file format used by tskit. In the current implementation, each tree is stored with new nodes in the tree sequence file format, leading to no compression. In addition, information about how long branches persist, and how many mutations map to a branch are lost by this conversion.
- -i,--input
- Filename of .anc and .mut files without file extension.
- -o,--output
- Filename of the output tree sequence file without file extensions.
PATH_TO_RELATE/bin/RelateFileFormats \
--mode ConvertToTreeSequence \
-i example \
-o example
Output: example.trees.
Extract tree at a SNP of interest in newick format
Extracts a tree at a SNP of interest in newick format.
- --anc
- Filename of the .anc file.
- --mut
- Filename of the .mut file.
- --bp_of_interest
- Integer specfifying the BP position of interest
- -o,--output
- File of the output files without file extensions.
PATH_TO_RELATE/bin/RelateExtract\
--mode TreeAtSNPAsNewick \
--anc example.anc \
--mut example.mut \
--bp_of_interest 3000000 \
-o example
Output: example_at_.newick.
External resources
Below is a list of external resources that can be useful for Relate-based analyses. These resources were compiled by thrid parties, and their respective conditions (such as citations of corresponding publications) for using these resources apply.
Humans
Here is a dropbox link to hg37 and hg38 ancestral genomes, recombination maps, and mappability masks.External links: