Mini-Tutorals: Difference between revisions

From QiuLab
Jump to navigation Jump to search
imported>Lab
imported>Lab
Line 27: Line 27:
# assigns taxonomic labels to short DNA reads by examining the k-mers within a read and querying a database with those k-mers: <code>kraken -db /belfer-ebox/projects/old_backup/qiulab/minikraken_20141208 --fastq-input 02015P1_S18_L001_R1_001.fastq.gz 02015P1_S18_L001_R2_001.fastq.gz --gzip-compressed --output outfile --paired </code>
# assigns taxonomic labels to short DNA reads by examining the k-mers within a read and querying a database with those k-mers: <code>kraken -db /belfer-ebox/projects/old_backup/qiulab/minikraken_20141208 --fastq-input 02015P1_S18_L001_R1_001.fastq.gz 02015P1_S18_L001_R2_001.fastq.gz --gzip-compressed --output outfile --paired </code>
# summarize taxonomy: <code> kraken-translate --db /belfer-ebox/projects/old_backup/qiulab/minikraken_20141208 outfile > predict.file </code>
# summarize taxonomy: <code> kraken-translate --db /belfer-ebox/projects/old_backup/qiulab/minikraken_20141208 outfile > predict.file </code>
# count unique strains: <code>pat-8]$ cut -f2 s75.predict |  cut -f10 -d';' | grep "^[a-zA-Z]" | sort | uniq -c</code>.
# count unique strains: <code>pat-8]$ cut -f2 s75.predict |  cut -f10 -d';' | grep "^[a-zA-Z]" | sort | uniq -c |  sort -rn</code>.
# Pick the strain with the highest counts as the reference genome
# Pick the strain with the highest counts as the reference genome



Revision as of 21:36, 7 September 2017

D3js Tutorial

  1. Install brackets editor from adobe: http://brackets.io/
  2. Enable web server & create a directory "public_html" in your home directory
  3. Create the following file structure: test/index.html; test/js/index.js; test/css/index.css
  4. Basic D3: following this link
  5. Metabolomics: following this link

An R function to transform a wide data frame to a long one

df2long = function(x, varname, cname, rname){

 num.col = dim(x)[2]
 x$tmp = rownames(x)
 y = reshape(x, varying=1:num.col, v.names=varname, timevar=cname, times=names(x)[1:num.col], direction="long")
 y=y[,-4]
 rownames(y)=NULL
 colnames(y)[1] = rname
 y

}

Variant verification using IGV

  1. Prepare & load reference genomes
    1. Load FASTA: bioseq -i'genbank' ref.gb > ref.fas
    2. Prepare GFF3 file: bp_genbank2gff3.pl --CDS --filter exon --filter gene /home/chongdi/michelle/pa_rp73.gb
  2. Prepare BAM files (see below on how to index and align reads using bwa and samtools, in "ospC Amplicon")
  3. Index sorted bam files with samtools & load BAM files (multiple bam files okay)

KRAKEN

  1. assigns taxonomic labels to short DNA reads by examining the k-mers within a read and querying a database with those k-mers: kraken -db /belfer-ebox/projects/old_backup/qiulab/minikraken_20141208 --fastq-input 02015P1_S18_L001_R1_001.fastq.gz 02015P1_S18_L001_R2_001.fastq.gz --gzip-compressed --output outfile --paired
  2. summarize taxonomy: kraken-translate --db /belfer-ebox/projects/old_backup/qiulab/minikraken_20141208 outfile > predict.file
  3. count unique strains: pat-8]$ cut -f2 s75.predict | cut -f10 -d';' | grep "^[a-zA-Z]" | sort | uniq -c | sort -rn.
  4. Pick the strain with the highest counts as the reference genome

PATRIC microbial genome annotation CLI

  1. fetching all genomes from a genus

p3-all-genomes --in genome_name,Borrelia > Borrelia.genomes p3-get-genome-data -i Borrelia.genomes -e genome_status,complete > Borrelia-complete.genomes

  1. retrieves features from coding sequences only: input file must be a list of genome ids with a header!

p3-get-genome-features -i $file -e feature.feature_type,CDS -a pgfam_id -a patric_id -a plfam_id -a start -a end -a strand -a genome_name -a product -a accession > ${name}-all-features.txt

Parsimony reconstruction using MPR

# MPR for each column
plot.mpr <- function(column=1) {
  plot.phylo(tr.unrooted, main = paste(colnames(ph.t)[column]))
  tmpr<-MPR(ph.t[,column], tr.unrooted, outgroup = "gid_1311.1320")
  nodelabels(paste("[", tmpr[, 1], ",", tmpr[, 2], "]", sep = "")) 
  tiplabels(ph.t[,column][tr.unrooted$tip.label], adj = -2)
}

# add internal node states
datalist <- data.frame(fam=character(), a=numeric(), b=numeric(), c=numeric())
for (i in 1:ncol(ph.t)) {
#for (i in 1:10) {
  tmpr<-MPR(ph.t[,i], tr.unrooted, outgroup = "gid_1311.1320");
  out<-c(colnames(ph.t)[i],tmpr[,1])
  datalist <- rbind(datalist, data.frame(fam=colnames(ph.t)[i], a=tmpr[1,1], b=tmpr[2,1], c=tmpr[3,1])) 
}
#plotting gains and losses
tr.unrooted <- unroot(tr)
gains<-apply(ph.node.states[,9:15], 2, function(x) length(which(x>0)))
losses<-apply(ph.node.states[,9:15], 2, function(x) length(which(x<0)))
plot(tr.unrooted)
edgelabels(gains, adj = c(0.5,-0.25), col=2, frame= "none")
edgelabels(losses, adj = c(0.5,1.25), col=4, frame= "none")

ospC amplicon identification

# de novo amplicon assembly with cortex_con (quality threshold 10, minimum coverage 100)
~/cortex_con/bin/cortex_con_31 --input cortex-input-file-list --input_format fastq --kmer_size 31 --mem_height 17 mem_width 100 -d cortex-out-2  -q 10 -s 100

# Then run mapsembler to extend on both ends (-t2: contig, -p: prefix, -c: minimum coverage)
~/mapsembler2_pipeline/run_mapsembler2_pipeline.sh -s seq-24r.nuc -r "4_S4_L001_R1_001.fastq.gz 4_S4_L001_R2_001.fastq.gz" -t2 -p map-4 -c 200

#Index nucleotide file: 
bwa index ref.fa

# may also be necessary to match reads separately
bioseq --break ref.fa
for f in *.fas; do bwa.index $f; done

#align:
bwa mem ref.fa sample_read1.fq  sample_read2.fq > sample.sam 

#BAM output: 
samtools view -b sample.sam > sample.bam

#Sort & index BAM file:
samtools sort sample.bam sample.sorted 
samtools index sample.sorted

#Extract Coverage by samtools (not good: capped artificially at depth=8k)
samtools depth sample.sorted.bam > sample.depth

# Calculate coverage by bedtools (better, raw read coverages)
bedtools coverage -abam sample.sorted.bam -b refs.bed -d > sample.bedtools.cov # per-site
bedtools coverage -abam sample.sorted.bam -b refs.bed -count > sample.bedtools.cov # average

#R plot:
library(lattice)
x <- read.table("sample.depth", sep="\t", header=F)
colnames(x) <- c("ref", "pos", "cov")
xyplot(log10(cov) ~ pos|ref, data= x, type = "l", main= "sample")

Running a Screen Session

use byobu

  1. start a screen session by typing "byobu"
  2. run commands
  3. Detach by pressing "F6"
  4. Reattach by typing "byobu"
  5. Terminate by typing "exit"

use screen

  • Start a screen session
screen
  • Detach the running mission
ctrl + A + D
  • Show the list of running process
screen -ls
  • Reattach a running process
screen -r ProcessID
  • Terminate a process
ctrl + A
:quit

Bp-utils: sequence, alignment & tree utilities by Qiu Lab

bioseq: sequence/FASTA manipulations

  • Use accession "CP002316.1" to retrieve the Genbank file from NCBI. Save the output (in genbank format) to a file named as "cp002316.gb".
bioseq -f "CP002316.1" -o'genbank' > cp002316.gb
  • Use the above file as input, extract FASTA sequences for each genes and save the output to a new file called "cp002316.nuc". Use this file for the following questions.
bioseq -i "genbank" -F cp002316.gb > cp002316.fas
  • Count the number of sequences.
bioseq -n cp002316.fas
  • In a single command, pick the first 10 sequences and find their length
bioseq -p "order:1-10" cp002316.fas | bioseq –l
  • In a single command, pick the third and seventh sequences from the file and do the 3-frame translation. Which reading frame is the correct on both? Specify
bioseq -p "order:3,7" cp002316.fas | bioseq -t3
  • Find the base composition of the last two sequences
bioseq -p "order:25-26" cp002316.fas| bioseq –c
  • Pick the sequence with id "Bbu|D1_B11|8784|9302|1" and count the number of codons present in this sequence
bioseq -p "id:BbuJD1_B11|8784|9302|1" cp002316.fas | bioseq –C
  • Delete the last 10 sequences from the file and save the output to cp002316-v2.nuc
bioseq -d "order:17-26" cp002316.fas > cp002316-v2.nuc
  • In a single command, pick the first sequence, then get the 50-110 nucleotides and make reverse complement of the sub-sequences
bioseq -p "order:1" cp002316.fas | bioseq -s "50,110" | bioseq –r
  • In a single command, get the first 100 nucleotides of all the sequences present in the file and do 1-frame translation of all sub-sequences.
bioseq -s "1,100" cp002316.fas | bioseq -t1

bioaln: alignment/CLUSTALW manipulations

  • Go to /home/shared/LabMeetingReadings/Test-Data and find the sequence alignment file “bioaln_tutorial.aln”. Name the format of the alignment file. Use it to answer all the questions below.
CLUSTALW
  • Find the length of the alignment.
bioaln -l bioaln_tutorial.aln
  • Count the number of the sequences present in the alignment.
bioaln -n bioaln_tutorial.aln
  • How do you convert this alignment in phylip format? Save the output.
bioaln -o "phylip" bioaln_tutorial.aln > test.phy
  • Pick “seq2, seq5, seq7, seq10” from the alignment and calculate their average percent identity.
bioaln -p "seq2, seq5, seq7, seq10" bioaln_tutorial.aln | bioaln -a
  • Get an alignment slice from “50-140” and find the average identities of the slice for sliding windows of 25.
bioaln -s "50, 140" bioaln_tutorial.aln | bioaln -w "25"
  • Extract conserved blocks from the alignment.
bioaln -B bioaln_tutorial.aln
  • Find the unique sequences and list their ids.
bioaln -u bioaln_tutorial.aln | bioaln -L
  • Extract third sites from the alignment and show only variable sites in match view.
bioaln -T bioaln_tutorial.aln | bioaln -v | bioaln -m
  • Remove the gaps and show the final alignment in codon view for an alignment slice “1-100”.
 bioaln -s "1, 100" bioaln_tutorial.aln | bioaln -g | bioaln -c
  • Add a 90% consensus sequence and then show the final alignment in match plus codon view for an alignment slice “20-80”. (Hint: match view followed by codon view)
bioaln -s "20, 80" bioaln_tutorial.aln | bioaln -C "90" | bioaln -m | bioaln -c

biotree: tree/NEWICK manipulations

biopop: SNP statistics

Homology searching and clustering

BLAST+: search("google") for homologs/pariwise alignment

hmmer

cdhit

cdhit -i all.pep -o all.cdhit -c 0.5 -n 3

Options:

  • -i: input file
  • -o: output file
  • -c: percent identity (below which it is considered different families)
  • -n: word length

interproscan

../../software/interproscan/interproscan-5.13-52.0/interproscan.sh -i trep-cdhit.representatives.pep2 -o  trep-representatives.tsv -t p -goterms -pa -f tsv

Documentation page: How to run

Programs for producing multiple alignments

MUSCLE

MUGSY

  • MUGSY bash
#!/bin/sh

export MUGSY_INSTALL=/home/weigang/mugsy_x86-64-v1r2.2
export PATH=$PATH:$MUGSY_INSTALL:$MUGSY_INSTALL/mapping
export PERL5LIB=$MUGSY_INSTALL/perllibs
#For testing TBA
#export PATH=$PATH:$MUGSY_INSTALL/../../multiz-tba/trunk/
  • source the bash file
source mugsyenv.sh
  • run mugsy
mugsy --directory /home/chongdi/Streptococcus/mugsy-output -prefix mugsy_aln mugsy-input/*.fa

CLUSTALW

MAFT

TCOFFEE

Programs for producing phylogeny & phylogenetic analysis

FastTree

PHYLIP

MrBayes

RaXML

  • Required arguments
    • -s alignment (in PHYLIP or FASTA)
    • -n tag
  • Simple run (ML): raxmlHPC-SSE3 -x 12345 -p 12345 -# autoMRE -s concat.fas -m GTRGAMMA -n tag -q part.txt
  • Bootstrap: raxmlHPC-SSE3 -f a -x 12345 -p 12345 -# 100 -s concat.fas -m GTRCAT -n tag -q part.txt
  • Make a file named "part.txt" with the following lines (Chanlge the number to the total length of your alignment):
DNA, gene1codon1 = 1-3765906\3
DNA, gene1codon2 = 2-3765906\3
DNA, gene1codon3 = 3-3765906\3
  • The resulting files
    • RAxML_bestTree.tag: best tree (no bootstrap)
    • RAxML_bipartitionsBranchLabels.tag: ignore
    • RAxML_bipartitions.tag: main result. Feed this tree to figtree
    • RAxML_bootstrap.tag: ignore
    • RAxML_info.tag : log file
  • Protein models
    • raxmlHPC-SSE3 -s protein.phy -n A1 -m PROTGAMMAWAS # protein, gamma, Whelan & Goldman (2001) model
    • raxmlHPC-SSE3 -s protein.phy -n A2 -m PROTGAMMAGTR # protein, gamma, user model
    • raxmlHPC-SSE3 -s protein.phy -n A1 -m PROTGAMMAWAS -# 100 -b 0123 # protein, gamma, Whelan & Goldman (2001) model, bootstrap
    • raxmlHPC-SSE3 -s protein.phy -n A1 -m PROTGAMMAWAS -o Carp,Loach # protein, gamma, Whelan & Goldman (2001) model, root on a (monophyletic group)
    • raxmlHPC-SSE3 -s protein.phy -n A1 -m PROTGAMMAWAS -# autoFC -b 0123 # protein, gamma, Whelan & Goldman (2001) model, bootstrap (at least 99 splits, auto-stopping)
    • raxmlHPC-SSE3 -f a -s protein.phy -n A1 -m PROTGAMMAWAG -# 100 -p 0123 -x 0123 # Rapid bootstrap with consensus
      • output 1, RAXML_bestTree.A1 (ml tree)
      • output 2, RAXMLbootstrap.A1 (bootstrap relicates)
      • output 3, RAXMLbipartitions.A1 (tree with boot strap values)

PhyloNet

R packages for phylogenetics

APE

phengorn

phytools

Population genetics

ms: coalescence simulation

SFS: forward simulation

PAML: testing selection with Ka/Ks

Microbial genome databases & pipelines in Qiu Lab

borreliabase

pa2

spiro_genomes/treponema

Blast a set of genes against a bacteria genome

  1. download genome
  2. extract gene sequences & translate
  3. Make blast database
  4. Run blastp

de novo variant call with cortex_var

Create binary file of fasta genome file.

Run contex_var_31_c1 (cutoff 1 used for 1 genome)

  • --se_list is the command the reads the list you want to target (ie: list-genome.txt)
  • --kmer_size is the middle size, has to be an odd integer
  • --mem_width always choose 17
  • --mem_height always choose 100
  • --dump_binary Name your file name (ie: Genome.ctx)
/home/weigang/CORTEX_release_v1.0.5.21/bin/cortex_var_31_c1 
--se_list list-Evo.txt 
--kmer_size 31 
--mem_width 17 
--mem_height 100 
--dump_binary Evo.ctx 
> Evo.log save log file

Read each binary file (.ctx) into its own individual color list (ls Evo.ctx > Evo.colorlist) Then save these lists into their own collective colorlist.txt (ls *.ctx > colorlist.txt)

Reveal genetic variation using the Bubble Caller from cortex_var.

/home/weigang/CORTEX_release_v1.0.5.21/bin/cortex_var_31_c5 
--se_list colorlist.txt 
--kmer_size 31 
--mem_width 17 
--mem_height 100 
--dump_binary all-colors.ctx 
> all-colors.log save log file

Bubble caller will detect differences between each genome by assigning distinct colors to each genome (note that the UK spelling of color is used: colour)

  • --multicolour_bin holds your all-colors.ctx binary from the Bubble Caller
  • --detect_bubbles1 i/i Detects 1 variation between genomes i and i. i indicates the position number the genome is listed on the colorlist.txt file. If the genome is fourth on the colorlist.txt, for example, its corresponding i variable is 4
  • --output_bubbles1 Output variant reads in fasta format (ie: Evo-RefHG.var for bubble detection between

Evolved genome and Reference HG genome)

  • --print_colour_coverages necessary for output
/home/weigang/CORTEX_release_v1.0.5.21/bin/cortex_var_31_c5 
--kmer_size 31  
--mem_height 17 
--mem_width 100 
--multicolour_bin all-colors.ctx 
--detect_bubbles1 0/1 
--output_bubbles1 Evo-RefHG.var 
--print_colour_coverages  
> Evo-RefHG.log save log file

Variant call with cortex_var

Example

  • Files
MS00018_S5_L001_R1_001.fastq
MS00018_S5_L001_R2_001.fastq
KU-090401_S3_L001_R1_001.fastq
KU-090401_S3_L001_R2_001.fastq
...
  • File content
@M03268:52:000000000-AJFAY:1:1101:16970:1555 1:N:0:7
CCCATGAACGGCACGTTCACGATGCAGAAGGTGGTGACCAGGCCGGTGTCGGCGACCTCGACGTAATCGTCGGTGGGGGA
GCCGTCGCGCGGGTCCGCGCCGCGCGGCGGGAAGTACACCTTGCCGTCCATGCGGGCGCGGGCGCCCAGCAGACGACCCT
CCATGAGCGCATTCAGGAATTCGGCCTCCTGCGGCGCGGCGGTGTGCTTGATATTGAAGTCGACCGGGGTCACGATGCCG
GTGACCGGTT
+
11>1>111@11>1A1FGF1FC0F0A111010GB0ECBFGF0/AFECGGFHECE??EEGHE/EEEHEFHEHHGCEECE??/
<CFCGGCCGCCCCCCHCGGCCCG@G??@@?@-@BFFFFFFFFFFF;B@FBFBFF<?@;@@-=?@??@-EFFFFF@;@@@F
FFBF/BBF;@@FFFFFFFFFFFFF@FFFFFFF<@@@@@@@@@@@FBFFFFFFFFFFFFFFFFF@@@?@?@FFFFFFFBF@
?@@EFF@@<B
@M03268:52:000000000-AJFAY:1:1101:16136:1618 1:N:0:7
TCCTGGCCCGTGAAACCGCTTGCCCGGTACAGGTTCTGGACTACCGCCTGGCACCCGAGCATCCGTTCCCGGCGGCGCTC
GACGACGCCGAGGCGGCGTTCCTGGAACTGGTGGCCGCCGGCTACCGGCCCGAACGCATTGCGGTCGCGGGTGATTCGGCCGGTGGCGGGCTCTCGCTCGCGCTGGCCGAACGGCTGCGCGACCGGCACGGGCTGGTTCCGGCCGCGCTCGGGCTGATCGCGCCCTGGGC
+
11>>11C1A@A?11BDF?EEGAFGFG?ECCHBHHGFG1EGHFHHHCCGGC/GCGGGCEECEFGHGGFFGHGGGGCECCC<CCC@CCCC@CGCCCGGC?:@EBFBF/CFFFF0/CFG?=@=-9>AFF@=@@?@;@@@F--9:BF@A@@?@--;9-BFFFF@A-@99B?-9@?=EFFBAAE;9>?;@@BF@9-@-;@=-E@@@=@@;@?>@9@<?@?BBBFFF;?>;;-@@?@<?@?9-FBFF@-99-9E-9
...

Step 1

  • Create matched FASTQ files (python script)
#!/usr/bin/python                                                               
  
from sys import argv
script, File1, File2 = argv

# Create a dictionary listing the sequences in the first file for reference
file1 = open(File1)
dict1 = {}
for line in file1:
    if '@M03268' in line:
        tag1 = line.rstrip()[:-9]
        tail = line.rstrip()[-9:]
        dict1[tag1] = []
    else:
        dict1[tag1].append(line.rstrip())
file1.close()

# Create two output files
f1 = open(File1.replace('.fastq', '_mat.fq'), 'w')
f2 = open(File2.replace('.fastq', '_mat.fq'), 'w')

# Match the sequence
file2 = open(File2)
for line in file2:
    if dict1.has_key(line.rstrip()[:-9]): # The has_key method
        tag1 = line.rstrip()[:-9]
        f1.write(tag1 + tail + '\n')
        for j in range(3):
            f1.write(dict1[tag1][j] + '\n')
        del dict1[tag1]
        dict2 = {} # Create a temporary dictionary for sequence in the file2
        tag2 = line.rstrip()
        dict2[tag2] = []
    else:
        dict2[tag2].append(line.rstrip())
    if len(dict2[tag2]) == 3:
        f2.write(tag2 + '\n')
        for j in range(3):
            f2.write(dict2[tag2][j] + '\n')
file2.close()

f1.close()
f2.close()

Step 2. Clean Fastq Files & Run Single-color Graph & Error Cleaning

  • Create a file list showing all outcome files whose extensions need to be changed from _mat.fq to .list
bbduk.sh -Xmx1g in1=fastq_file1 in2=fastq_file2 -out1=clean1.fq -out2=clean2.fq qtrim=rl trimq=20
  • Create a file list showing all outcome files whose extensions need to be changed from _mat.fq to .list
for f in *_mat.fq; 
do 
  title=$(echo $f | cut -d'_' -f2); 
  id=$(echo $f | cut -d'_' -f1); 
  echo $f > ${id}.list${title}; 
done
  • Single color graph for sample
../../CORTEX_release_v1.0.5.21/bin/cortex_var_31_c1 
--pe_list MS00018_S5.list1,MS00018_S5.list2 
--kmer_size 31 
--mem_height 17 
--mem_width 100 
--dump_binary MS00018_S5.ctx 
--sample_id MS00018_S5 
--remove_pcr_duplicates 
--quality_score_threshold 20 > MS00018_S5.log
  • Single color graph for reference
../../CORTEX_release_v1.0.5.21/bin/cortex_var_31_c1 
--se_list ref.filelist 
--kmer_size 31 
--mem_height 17 
--mem_width 100 
--dump_binary ref.ctx 
--sample_id ref 20 > ref.log
  • Run Error Cleaning for All Samples (reference is not included)
../../CORTEX_release_v1.0.5.21/bin/cortex_var_31_c1 
--mem_height 18
--mem_width 100
--kmer_size 31
--multicolour_bin N18_S15.ctx
--remove_low_coverage_supernodes 10
--dump_binary N18_S15.cleaned.ctx

Step 3

  • Pull the name of each .cleaned.ctx file to a cleaned.list file, then create a .filelist file for all cleaned.list files.
ls file1.cleaned.ctx > file1.cleaned.list
ls file2.cleaned.ctx > file2.cleaned.list
ls ref.ctx > ref.list
ls -1 *.list > ref-sample.filelist
  • Multicolour Graph
../../CORTEX_release_v1.0.5.21/bin/cortex_var_31_c3
--mem_height 20 
--mem_width 100 
--kmer_size 31 
--colour_list ref-sample.filelist 
--dump_binary ref-sample.ctx > ref-sample.log

Step 4

  • Variation Discovery Using The Bubble Caller
../../CORTEX_release_v1.0.5.21/bin/cortex_var_31_c3 
--mem_height 20 
--mem_width 100 
--kmer_size 31 
--multicolour_bin ref-sample.ctx 
--detect_bubbles1 -1/-1 
--ref_colour 2 
--output_bubbles1 bubbles-in-sample.out 
--print_colour_coverages 
--experiment_type EachColourAHaploidSampleExceptTheRefColour 
--genome_size 8000000 > bubbles-in-sample.log

Step 5

  • Reference Genome

(When in Cluster execute "module load stampy": doesn't work; path problem) (Run the following on wallace:)

stampy.py -G ref ref.fa
stampy.py -g ref -H ref
  • Turn Into VCF with reference

Make a sample list file (from bubble or multicolor log file):

cat e1-bubbles-in-sample.log | grep CLEANED | cut -f2 > e1.sample.lis

Customize the following command based on your output files, num of colors, index of ref colors, etc

perl /home/weigang/CORTEX_release_v1.0.5.21/scripts/analyse_variants/process_calls-wq.pl --callfile e1-bubbles-in-sample.out --callfile_log e1-bubbles-in-sample.log -outvcf e1-bubbles-in-sample --outdir e1-vcfout --samplename_list e1.sample.list --num_cols 7 --stampy_bin /home/weigang/stampy-1.0.28/stampy.py --stampy_hash ref --refcol 6 --vcftools_dir /usr/local/bin --caller BC --kmer 31 --ploidy 1

Step 6. Parse VCF files

  1. Filter out low-quality sites:

vcftools --vcf pat-5.decomp.vcf --keep-filtered PASS --recode --out pat-5

  1. Extract coverage

vcftools --vcf pat-5.recode.vcf --extract-FORMAT-info COV (output file: out.COV.FORMAT)

  1. Extract genotypes

vcftools --vcf pat-5.recode.vcf --extract-FORMAT-info GT (output file: out.GT.FORMAT)

  1. Extract confidence

vcftools --vcf pat-5.recode.vcf --extract-FORMAT-info GT_CONF (output file: out.GT_CONF.FORMAT)

  1. Send GT_CONF file to R and visualize log10(conf) distribution with boxplot
  2. Use custom PERL file to filter out low-quality (e.g., GT_CONF < 30) genotype calls (flag with "?" or "na"), and make haplotype GT ("1/1" to "1", "0/0" to "0", "./." to "?)
  3. Verify variants using IGV (see IGV protocol above)

Step 7. Variant Annotation & Visualization

  1. vcf_parser.py out.decomp.vcf ref.gb (the code needs validation)
  2. Data to database & web visualization, if necessary

hmmer

Annotate proteins with TIGRFAM

hmmsearch --tblout foo.hmmout   # table output for all sequences
          --domtblout foo.dmout # table output for all domains
          -E 0.01               # level of sequence significance
          --domE 0.01           # level of domain significance
          -o /dev/null          # don't show STDOUT 
          ../../TIGRFAMs-Release-15-Sep-17-2014/TIGRFAMs_15.0_HMM.LIB # HMM profile library for tiger fams 
          GCA_000583715.2_ASM58371v2_protein.faa &                    # input/query file in FASTA

PopGenome

library(PopGenome)
g = readVCF("pvt1.recode.vcf.gz", 1000, "8", 127890000, 128102000)
pops = split(sample[,1], sample[,2]) # create a list of populations
g = set.populations(g, pops, diploid = T) # set population names

# by windows
slide = sliding.window.transform(g, width = 100, jump = 20) # nsnps, not actual length
slide = F_ST.stats(slide, mode = "nucleotide")
snp.pos = slide@region.data@biallelic.sites # SNP positions
win.num = length(slide@region.names)
win.start = numeric()
for (i in 1:win.num) {win.start[i] = snp.pos[[i]][1]}
fst = slide@nuc.F_ST.vs.all
pop.names = names(slide@populations) # population names
plot(win.start, fst[,1], type ="n", las = 1, ylab = expression(F[st]), xlab = "SNP Position", ylim = c(0,0.4))
for (i in 1:length(slide@populations)) {
  lines(win.start, fst[,i], type = "l", col = pop.group[pop.names[i],4])
}
arch.coords=c(127982050, 127992931)
abline(v = arch.coords, col = "orange")
#rect(xleft = arch.coords[1], ybottom = -1, xright = arch.coords[2], ytop = 0.5, border = "transparent", col = 2)

Velvet

Cleaning

../../bbmap/bbduk.sh -Xmx1g 
           in1=WGC067462_hunhewenku_509_combined_R1.fastq # gz file works as well
           in2=WGC067462_hunhewenku_509_combined_R2.fastq # gz file works as well
           -out1=clean1.fq 
           -out2=clean2.fq 
           qtrim=rl
           trimq=20

Running Velvet Optimizer

srun ../../VelvetOptimiser-2.2.5/VelvetOptimiser.pl 
          --t 32
          --s 31 --e 31 --x 6 # kmer sizes
          -f '-shortPaired -fastq clean1.fq -shortPaired2 -fastq clean2.fq'
          -t 4 
          --optFuncKmer ‘n50’ 
          -p prefix

PacBio assembly with canu

./canu -p staph-auto-5 -d staph-auto-5 genomeSize=2.2m -pacbio-raw pac-reads-5.tar.gz