Input data file format

The input file format for Relate is the haps/sample file format, which is used as an output file format by SHAPEIT2 (see here).

.haps The first five columns are:
  1. Chromosome number [integer]
  2. SNP ID [string]
  3. SNP position [integer]
  4. Ancestral allele [char]
  5. Alternative allele [char]
Then, the successive column pairs correspond to the haplotypes carried by an individual at that SNP.
Please make sure that the ancestral allele is represented by 0 and the alternative allele is represented by 1.
7 SNP1 123 A G 0 0 1 0 0 0 1 1
7 SNP2 456 T C 0 1 1 0 0 1 0 1
7 SNP3 789 A T 0 1 1 0 1 1 1 1
This file is space delimited.
.sample The three columns are:
  1. First individual ID
  2. Second individual ID
  3. Missing data proportion
Diploid organisms:
ID_1 ID_2 missing
0 0 0
UNR1 UNR1 0
UNR2 UNR2 0
UNR3 UNR3 0
UNR4 UNR4 0
Haploid organisms:
ID_1 ID_2 missing
0 0 0
UNR1 NA 0
UNR2 NA 0
UNR3 NA 0
UNR4 NA 0
This file is space delimited.

Recombination rates are specified using a genomic map.

.map This is the standard file format for genetic recombination maps. The three columns are:
  1. Position (b) [integer]
  2. Recombination rate (cM/Mb) [float]
  3. Genetic position (cM) [float]
Denoting the ith entry of the three columns by p[i], r[i], rdist[i], the following equation holds
r[i] = (rdist[i+1] - rdist[i])/(p[i+1] - p[i]) * 1e6
pos COMBINED_rate Genetic_Map
0 2.8074 0.4103
2529 2.7778 0.4174
2601 2.9813 0.4176
This file is space delimited.

For some add-on modules of Relate, we also want to specify population labels. The .poplabels file format is identical to the sample file in the hap/legend/sample file format.

.poplabels The four columns are:
  1. Individual ID as specified in the .sample file [string]
  2. Population label [string]
  3. Group label [string]
  4. Sex [integer]
Diploid organisms:
sample population group sex
UNR1 PJB SAS NA
UNR2 JPT EAS NA
UNR3 GBR EUR NA
UNR4 YRI AFR NA
Haploid organisms:
sample population group sex
UNR1 PJB SAS 1
UNR2 JPT EAS 1
UNR3 GBR EUR 1
UNR4 YRI AFR 1
For haploid organisms (or haplotype level annotation in diploid organisms) set the SEX column to 1.
This file is space delimited.
Samples must be listed in the same order as in the .sample file above.

Convert to haps/sample file format


Convert from hap/legend/sample

Converts from the hap/legend/sample file format (used for reference panels in e.g., IMPUTE2, see here) to the haps/sample file format. The code deletes non-biallelic SNPs.
--haps
Filename of .haps file.
--sample
Filename of .sample file
-i,--input
Filename (without file extension) of the hap/legend/sample files. The input files can be gzipped; we look for files with extensions hap,hap.gz/legend,legend.gz/sample,sample.gz, with preference given to the non-gzipped file.
--chr
Optional: specifies a chromosome index [int] (used as the first column in the haps file). Default: 0.
 PATH_TO_RELATE/bin/RelateFileFormats \
                 --mode ConvertFromHapLegendSample \
                 --haps example.haps \
                 --sample example.sample \
                 -i example \
                 --chr 1  
Output: example.haps, example.sample.

Convert from vcf

Converts from the vcf file format to the haps/sample file format.
--haps
Filename of .haps file.
--sample
Filename of .sample file
-i,--input
Filename (without file extension) of the vcf file. The input files can be gzipped; we look for files with extensions hap,hap.gz/legend,legend.gz/sample,sample.gz, with preference given to the non-gzipped file.
--chr
Optional: specifies a chromosome index [int] (used as the first column in the haps file). Default: 0.
 PATH_TO_RELATE/bin/RelateFileFormats \
                 --mode ConvertFromVcf \
                 --haps example.haps \
                 --sample example.sample \
                 -i example 
Output: example.haps, example.sample.

Convert from ms output format

R script that converts from the output file format used by ms to the haps/sample file format.
The script removes any sites with multiple mutations.
If the input file contains more than one simulation, these get separated into "chromosomes".

Download the script.

 Rscript ms2haps.R infile.ms example nsites
infile.ms
Input filename with file extension.
example
Output filename withouth file extension.
nsites
(Optional) Number of simulated sites. Default value is 1. This is multiplied to the positions.
Output: example.haps, example.sample.

Preparing input files


The following code can be used to prepare the input data. We recommend to apply all steps (if applicable) in the listed order.

All input files can be gzipped.

Prepate input files

Prepares input files for Relate. The script will remove non-biallelic SNPs, flip haplotypes according to the ancestral genome such that the ancestral allele is always denoted by 0, remove samples if specified, filter SNPs and adjust distances between SNPs using a genomic mask if specified, and generate SNP annotations needed for some add-on modules.
See here for links to external resources, such as an ancestral genome.
--haps
Filename of .haps file.
--sample
Filename of .sample file
-o,--output
Filename of output files without file extensions. If same as input filenames, input files will be overwritten.
--ancestor
Fasta file containing ancestral genome. This is case insensitive.
--mask
Optional: Fasta file of same length as the ancestral genome containing a genomic mask. Loci passing the mask are denoted by P, loci not passing the mask are denoted by N. This is case insensitive.
--remove_ids
Optional: File containing ids of samples that will be removed from the haps/sample files. One id per line.
--poplabels
Optional: File containing population labels of samples.
 PATH_TO_RELATE/scripts/PrepareInputFiles/PrepareInputFiles.sh \
                 --haps example.haps \
                 --sample example.sample \
                 --ancestor ancestor.fa \
                 --mask mask.fa \
                 --remove_ids remove_ids.txt \
                 --poplabels example.poplabels
                 -o example_input 
Output: example_input.haps, example_input.sample, example_input.dist (if --mask if specified), example_input.poplabels (if remove_ids and poplabels is specified), example_input.annot (if poplabels is specified).

Remove non-biallelic SNPs

Removes SNPs that are not biallelic in the data set.
 PATH_TO_RELATE/bin/RelateFileFormats \
                 --mode RemoveNonBiallelicSNPs \
                 --haps example.haps \
                 -o example_biallelic 
Output: example_biallelic.haps.

Determine ancestral allele and flip if necessary

The ancestral genome should be saved as a fasta file and aligned to the data set. The code is insensitive to case in the fasta file and deletes SNPs at which neither allele matches the ancestral allele.
 PATH_TO_RELATE/bin/RelateFileFormats \
                 --mode FlipHapsUsingAncestor \
                 --haps example.haps \
                 --sample example.sample \
                 --ancestor ancestor.fa \
                 -o example_ancestral 
Output: example_ancestral.haps

Remove samples from data set

Write ids of individuals as specified in the .sample file into remove_ids.txt (one id per line). These indivduals will be removed from the data set. Argument --poplabels is optional. If specified, it will generate a new poplabels file for the subsetted samples.
 PATH_TO_RELATE/bin/RelateFileFormats \
                 --mode RemoveSamples \
                 --haps example.haps \
                 --sample example.sample \
                 --poplabels example.poplabels \
                 -i remove_ids.txt \
                 -o example_subset 
Output: example_subset.haps, example_subset.sample, example_subset.poplabels (if --poplabels is specified).

Filter SNPs using a genomic mask

The genomic mask needs to be specified as a fasta file of same length as the ancestral genome, such that the xth entry is the xth base of the genome. The passing state should be denoted as P. The code is insensitive to case in the fasta file and deletes all SNPs that are not passing. It also outputs a .dist file that contains the distances between SNPs, adjusted for regions that are not passing in the mask.
 PATH_TO_RELATE/bin/RelateFileFormats \
                 --mode FilterHapsUsingMask\
                 --haps example.haps \
                 --sample example.sample \
                 --mask genomic_mask.fa \
                 -o example_mask 
Output: example_mask.haps, example_mask.dist.

Generate SNP annotations

This code uses the ancestral genome and population labels of samples to append additional information to the output of Relate. Fileformat for population labels is described here. The ancestral genome should be saved in fasta file format and aligned to the data set. Argument --ancestor is optional. If argument --mut is specified, output is directly appended to the .mut file.
 PATH_TO_RELATE/bin/RelateFileFormats \
                 --mode GenerateSNPAnnotations \
                 --haps example.haps \
                 --sample example.sample \
                 --poplabels example.poplabels \
                 --ancestor ancestor.fa \
                 -o example 
Output: example.annot. If --mut is specified, columns are appended to example.mut.

Convert output to other formats


The following code can be used to convert the output to other file format

Convert to tree sequence file format (tskit)

This function converts anc/mut files inferred by Relate into the tree sequence file format used by tskit. In the current implementation, each tree is stored with new nodes in the tree sequence file format, leading to no compression. In addition, information about how long branches persist, and how many mutations map to a branch are lost by this conversion.

-i,--input
Filename of .anc and .mut files without file extension.
-o,--output
Filename of the output tree sequence file without file extensions.
 PATH_TO_RELATE/bin/RelateFileFormats \
                 --mode ConvertToTreeSequence \
                 -i example \
                 -o example
                    
Output: example.trees.

Extract tree at a SNP of interest in newick format

Extracts a tree at a SNP of interest in newick format.

--anc
Filename of the .anc file.
--mut
Filename of the .mut file.
--bp_of_interest
Integer specfifying the BP position of interest
-o,--output
File of the output files without file extensions.
 PATH_TO_RELATE/bin/RelateExtract\
                 --mode TreeAtSNPAsNewick \
                 --anc example.anc \
                 --mut example.mut \
                 --bp_of_interest 3000000 \
                 -o example 
Output: example_at_.newick.

External resources


Below is a list of external resources that can be useful for Relate-based analyses. These resources were compiled by thrid parties, and their respective conditions (such as citations of corresponding publications) for using these resources apply.

Humans

Here is a dropbox link to hg37 and hg38 ancestral genomes, recombination maps, and mappability masks.

External links: