Preparing 1KGP dataset for merge with HapMap

Here is description on how 1kgp.bed, 1kgp.bim, and 1kgp.fam plink files were created. We will not reproduce these commands due to time constrans, but the curious ones can take a look here to see how 1KGP plink files were prepared.

Assumption is that you have access to the "raw" 1KGP (described bellow how to download) and HapMap files (hapmap_0 plink files present exercise/data directory i.e. before any QC was performed on them). By using these files we can maximize the number of overlapping SNPs between two datasets, which can be useful for constructing PCA or MDS. The 1KGP plink files that we use in tutorial (1kgp plink files preset in exercise/data) contain only SNPs found in HapMap plink files, and non-overlapping individuals between 1KGP and HapMap (81 individuals out of 629 from 1KGP present in HapMap were removed).


Download 1KGP data

To download the 1000 Genomes Project (1KGP) vcf file containing 629 individuals from different populations one can use the ftp link from an official International Genome Sample Resource (IGSR) webpage. The 1KGP vcf file is quite large (>60 gigabyte), and for the sake of the time and memory we will not run those command. Commands presented here are only for the reference on how the files we are using were created.

cd ~/gwas_exercises/data

wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz

cd ~/gwas_exercises

SNP calls based on 628 individuals from the 2010-08-04 sequence and alignment release of the 1000 Genomes Project is based on the GRCh37 assembly of the human genome, and are released in the format VCF 4.0 (IGSR November 2010 Data Release).

Once the data is downloaded (e.g. in gwas_exercises/data directory), we can prepare a list of SNPs present in HapMap plink files (i.e. hapmap_0.bim)

awk '{print $2}' data/hapmap_0.bim > out/hapmap_0.SNPs

And then we can extract SNPs present in unfilltered HapMap dataset from 1KGP.

plink --bfile data/ALL.2of4intersection.20100804.genotypes --extract out/hapmap_0.SNPs --make-bed --out out/1kgp_0

Change 1KGP build

The datasets must have the same build. We can change the 1KGP build to match HapMap. - HapMap: NCBI build 36 / UCSC hg18 - 1KGP: GRCh37 / UCSC hg19

The dbSNP Reference SNP (rs or RefSNP) do not depend on a reference genome. They point to a specified locus regardless of the differences in genomic assemblies. So we will just change the location of the SNPs based on their rsID.

Update chromosome and physical position of SNPs on the chromosome in 1KGP

awk '{print $2, $1}' data/hapmap_0.bim > out/hapmap_0.snp_chr
awk '{print $2, $4}' data/hapmap_0.bim > out/hapmap_0.snp_map
plink --bfile out/1kgp_0 --update-chr out/hapmap_0.snp_chr --update-map out/hapmap_0.snp_map --make-bed --out out/1kgp_1

Use plink2 to sort variants

plink2 --bfile out/1kgp_1 --sort-vars --make-pgen --out out/1kgp_2tmp
plink2 --pfile out/1kgp_2tmp --make-bed --out out/1kgp_2
rm out/1kgp_2tmp*

Set reference genome of 1kgp to match HapMap

awk '{print $2, $5}' data/hapmap_0.bim > out/hapmap_0.ref_list
plink --bfile out/1kgp_2 --reference-allele out/hapmap_0.ref_list --make-bed --out out/1kgp_3

Check for potential strand issues and resolve them

awk '{print $2, $5, $6}' out/1kgp_3.bim > out/1kgp_3.snp_strands
awk '{print $2, $5, $6}' data/hapmap_0.bim > out/hapmap_0.snp_strands
sort out/1kgp_3.snp_strands out/hapmap_0.snp_strands | uniq -u > out/all_differences.txt

Flip SNPs for resolving strand issues

awk '{print $1}' out/all_differences.txt | sort -u > out/flip_list.txt
plink --bfile out/1kgp_3 --flip out/flip_list.txt --reference-allele out/hapmap_0.ref_list --make-bed --out out/1kgp_4

Check for SNPs which are still problematic after they have been flipped

awk '{print $2 , $5, $6}' out/1kgp_4.bim > out/1kgp_4.snp_strans
sort out/1kgp_4.snp_strans out/hapmap_0.snp_strands | uniq -u  > out/uncorresponding_SNPs.txt

Remove problematic SNPs from 1KGP

awk '{print $1}' out/uncorresponding_SNPs.txt | sort -u > out/SNPs_for_exlusion.txt
plink --bfile out/1kgp_4 --exclude out/SNPs_for_exlusion.txt --make-bed --out out/1kgp_5

Update sex

File data/igsr_samples.tsv containing metadata for 1KGP samples was downloaded from https://www.internationalgenome.org/data-portal/sample

awk '{print $2}' out/1kgp_5.fam > out/1kgp_5.ind

grep -f out/1kgp_5.ind data/igsr_samples.tsv | awk 'FS="\t" {print $1, $1, $2, $4, $6}' > out/1kgp_5.metadata

sed -i 's/female/2/g' out/1kgp_5.metadata 
sed -i 's/male/1/g' out/1kgp_5.metadata

plink --bfile out/1kgp_5 --update-sex out/1kgp_5.metadata 1 --make-bed --out out/1kgp_6

Remove HapMap individuals from 1KGP

awk '{print $2, $2}' data/hapmap_0.fam > out/hapmap_0.ind

plink --bfile out/1kgp_6 --remove out/hapmap_0.ind --make-bed --out out/1kgp_7

Update IDs

We will use FID to indicate continental population of individual.

awk 'FS=" " {print $1, $2, $5, $2}' out/1kgp_5.metadata > out/update_IDs

plink --bfile out/1kgp_7 --update-ids out/update_IDs --make-bed --out data/1kgp

The output file of this command can be found in exercises/data, and we will use 1kgp plink files as a starting point for our tutorials.