Revision as of 19:11, 25 July 2017

An R function to transform a wide data frame to a long one=

df2long = function(x, varname, cname, rname){

 num.col = dim(x)[2]
 x$tmp = rownames(x)
 y = reshape(x, varying=1:num.col, v.names=varname, timevar=cname, times=names(x)[1:num.col], direction="long")
 y=y[,-4]
 rownames(y)=NULL
 colnames(y)[1] = rname
 y

}

Variant verification using IGV

Prepare & load reference genomes
1. Load FASTA
2. Prepare GFF3 file: bp_genbank2gff3.pl --CDS --filter exon --filter gene /home/chongdi/michelle/pa_rp73.gb
Prepare BAM files (see below on how to index and align reads using bwa and samtools, in "ospC Amplicon")
Load BAM files (multiple bam files okay)

KRAKEN

assigns taxonomic labels to short DNA reads by examining the k-mers within a read and querying a database with those k-mers: kraken -db /belfer-ebox/projects/old_backup/qiulab/minikraken_20141208 --fastq-input 02015P1_S18_L001_R1_001.fastq.gz 02015P1_S18_L001_R2_001.fastq.gz --gzip-compressed --output outfile --paired
summarize taxonomy: kraken-translate --db /belfer-ebox/projects/old_backup/qiulab/minikraken_20141208 outfile > predict.file

PATRIC microbial genome annotation CLI

fetching all genomes from a genus

p3-all-genomes --in genome_name,Borrelia > Borrelia.genomes p3-get-genome-data -i Borrelia.genomes -e genome_status,complete > Borrelia-complete.genomes

retrieves features from coding sequences only: input file must be a list of genome ids with a header!

p3-get-genome-features -i $file -e feature.feature_type,CDS -a pgfam_id -a patric_id -a plfam_id -a start -a end -a strand -a genome_name -a product -a accession > ${name}-all-features.txt

Parsimony reconstruction using MPR

# MPR for each column
plot.mpr <- function(column=1) {
  plot.phylo(tr.unrooted, main = paste(colnames(ph.t)[column]))
  tmpr<-MPR(ph.t[,column], tr.unrooted, outgroup = "gid_1311.1320")
  nodelabels(paste("[", tmpr[, 1], ",", tmpr[, 2], "]", sep = "")) 
  tiplabels(ph.t[,column][tr.unrooted$tip.label], adj = -2)
}

# add internal node states
datalist <- data.frame(fam=character(), a=numeric(), b=numeric(), c=numeric())
for (i in 1:ncol(ph.t)) {
#for (i in 1:10) {
  tmpr<-MPR(ph.t[,i], tr.unrooted, outgroup = "gid_1311.1320");
  out<-c(colnames(ph.t)[i],tmpr[,1])
  datalist <- rbind(datalist, data.frame(fam=colnames(ph.t)[i], a=tmpr[1,1], b=tmpr[2,1], c=tmpr[3,1])) 
}
#plotting gains and losses
tr.unrooted <- unroot(tr)
gains<-apply(ph.node.states[,9:15], 2, function(x) length(which(x>0)))
losses<-apply(ph.node.states[,9:15], 2, function(x) length(which(x<0)))
plot(tr.unrooted)
edgelabels(gains, adj = c(0.5,-0.25), col=2, frame= "none")
edgelabels(losses, adj = c(0.5,1.25), col=4, frame= "none")

ospC amplicon identification

# de novo amplicon assembly with cortex_con (quality threshold 10, minimum coverage 100)
~/cortex_con/bin/cortex_con_31 --input cortex-input-file-list --input_format fastq --kmer_size 31 --mem_height 17 mem_width 100 -d cortex-out-2  -q 10 -s 100

# Then run mapsembler to extend on both ends (-t2: contig, -p: prefix, -c: minimum coverage)
~/mapsembler2_pipeline/run_mapsembler2_pipeline.sh -s seq-24r.nuc -r "4_S4_L001_R1_001.fastq.gz 4_S4_L001_R2_001.fastq.gz" -t2 -p map-4 -c 200

#Index nucleotide file: 
bwa index ref.fa

# may also be necessary to match reads separately
bioseq --break ref.fa
for f in *.fas; do bwa.index $f; done

#align:
bwa mem ref.fa sample_read1.fq  sample_read2.fq > sample.sam 

#BAM output: 
samtools view -b sample.sam > sample.bam

#Sort BAM file:
samtools sort sample.bam sample.sorted 

#Extract Coverage by samtools (not good: capped artificially at depth=8k)
samtools depth sample.sorted.bam > sample.depth

# Calculate coverage by bedtools (better, raw read coverages)
bedtools coverage -abam sample.sorted.bam -b refs.bed -d > sample.bedtools.cov # per-site
bedtools coverage -abam sample.sorted.bam -b refs.bed -count > sample.bedtools.cov # average

#R plot:
library(lattice)
x <- read.table("sample.depth", sep="\t", header=F)
colnames(x) <- c("ref", "pos", "cov")
xyplot(log10(cov) ~ pos|ref, data= x, type = "l", main= "sample")

Running a Screen Session

use byobu

start a screen session by typing "byobu"
run commands
Detach by pressing "F6"
Reattach by typing "byobu"
Terminate by typing "exit"

use screen

Start a screen session

screen

Detach the running mission

ctrl + A + D

Show the list of running process

screen -ls

Reattach a running process

screen -r ProcessID

Terminate a process

ctrl + A
:quit

Bp-utils: sequence, alignment & tree utilities by Qiu Lab

bioseq: sequence/FASTA manipulations

Use accession "CP002316.1" to retrieve the Genbank file from NCBI. Save the output (in genbank format) to a file named as "cp002316.gb".

bioseq -f "CP002316.1" -o'genbank' > cp002316.gb

Use the above file as input, extract FASTA sequences for each genes and save the output to a new file called "cp002316.nuc". Use this file for the following questions.

bioseq -i "genbank" -F cp002316.gb > cp002316.fas

Count the number of sequences.

bioseq -n cp002316.fas

In a single command, pick the first 10 sequences and find their length

bioseq -p "order:1-10" cp002316.fas | bioseq –l

In a single command, pick the third and seventh sequences from the file and do the 3-frame translation. Which reading frame is the correct on both? Specify

bioseq -p "order:3,7" cp002316.fas | bioseq -t3

Find the base composition of the last two sequences

bioseq -p "order:25-26" cp002316.fas| bioseq –c

Pick the sequence with id "Bbu|D1_B11|8784|9302|1" and count the number of codons present in this sequence

bioseq -p "id:BbuJD1_B11|8784|9302|1" cp002316.fas | bioseq –C

Delete the last 10 sequences from the file and save the output to cp002316-v2.nuc

bioseq -d "order:17-26" cp002316.fas > cp002316-v2.nuc

In a single command, pick the first sequence, then get the 50-110 nucleotides and make reverse complement of the sub-sequences

bioseq -p "order:1" cp002316.fas | bioseq -s "50,110" | bioseq –r

In a single command, get the first 100 nucleotides of all the sequences present in the file and do 1-frame translation of all sub-sequences.

bioseq -s "1,100" cp002316.fas | bioseq -t1

bioaln: alignment/CLUSTALW manipulations

Go to /home/shared/LabMeetingReadings/Test-Data and find the sequence alignment file “bioaln_tutorial.aln”. Name the format of the alignment file. Use it to answer all the questions below.

CLUSTALW

Find the length of the alignment.

bioaln -l bioaln_tutorial.aln

Count the number of the sequences present in the alignment.

bioaln -n bioaln_tutorial.aln

How do you convert this alignment in phylip format? Save the output.

bioaln -o "phylip" bioaln_tutorial.aln > test.phy

Pick “seq2, seq5, seq7, seq10” from the alignment and calculate their average percent identity.

bioaln -p "seq2, seq5, seq7, seq10" bioaln_tutorial.aln | bioaln -a

Get an alignment slice from “50-140” and find the average identities of the slice for sliding windows of 25.

bioaln -s "50, 140" bioaln_tutorial.aln | bioaln -w "25"

Extract conserved blocks from the alignment.

bioaln -B bioaln_tutorial.aln

Find the unique sequences and list their ids.

bioaln -u bioaln_tutorial.aln | bioaln -L

Extract third sites from the alignment and show only variable sites in match view.

bioaln -T bioaln_tutorial.aln | bioaln -v | bioaln -m

Remove the gaps and show the final alignment in codon view for an alignment slice “1-100”.

 bioaln -s "1, 100" bioaln_tutorial.aln | bioaln -g | bioaln -c

Add a 90% consensus sequence and then show the final alignment in match plus codon view for an alignment slice “20-80”. (Hint: match view followed by codon view)

bioaln -s "20, 80" bioaln_tutorial.aln | bioaln -C "90" | bioaln -m | bioaln -c

biotree: tree/NEWICK manipulations

biopop: SNP statistics

Homology searching and clustering

BLAST+: search("google") for homologs/pariwise alignment

hmmer

cdhit

cdhit -i all.pep -o all.cdhit -c 0.5 -n 3

Options:

-i: input file
-o: output file
-c: percent identity (below which it is considered different families)
-n: word length

interproscan

../../software/interproscan/interproscan-5.13-52.0/interproscan.sh -i trep-cdhit.representatives.pep2 -o  trep-representatives.tsv -t p -goterms -pa -f tsv

Documentation page: How to run

Programs for producing multiple alignments

MUSCLE

MUGSY

MUGSY bash

#!/bin/sh

export MUGSY_INSTALL=/home/weigang/mugsy_x86-64-v1r2.2
export PATH=$PATH:$MUGSY_INSTALL:$MUGSY_INSTALL/mapping
export PERL5LIB=$MUGSY_INSTALL/perllibs
#For testing TBA
#export PATH=$PATH:$MUGSY_INSTALL/../../multiz-tba/trunk/

source the bash file

source mugsyenv.sh

run mugsy

mugsy --directory /home/chongdi/Streptococcus/mugsy-output -prefix mugsy_aln mugsy-input/*.fa

CLUSTALW

MAFT

TCOFFEE

Programs for producing phylogeny & phylogenetic analysis

FastTree

PHYLIP

MrBayes

RaXML

Required arguments
- -s alignment (in PHYLIP or FASTA)
- -n tag
Simple run (ML): raxmlHPC-SSE3 -x 12345 -p 12345 -# autoMRE -s concat.fas -m GTRGAMMA -n tag -q part.txt
Bootstrap: raxmlHPC-SSE3 -f a -x 12345 -p 12345 -# 100 -s concat.fas -m GTRCAT -n tag -q part.txt
Make a file named "part.txt" with the following lines (Chanlge the number to the total length of your alignment):

DNA, gene1codon1 = 1-3765906\3
DNA, gene1codon2 = 2-3765906\3
DNA, gene1codon3 = 3-3765906\3

The resulting files
- RAxML_bestTree.tag: best tree (no bootstrap)
- RAxML_bipartitionsBranchLabels.tag: ignore
- RAxML_bipartitions.tag: main result. Feed this tree to figtree
- RAxML_bootstrap.tag: ignore
- RAxML_info.tag : log file
Protein models
- raxmlHPC-SSE3 -s protein.phy -n A1 -m PROTGAMMAWAS # protein, gamma, Whelan & Goldman (2001) model
- raxmlHPC-SSE3 -s protein.phy -n A2 -m PROTGAMMAGTR # protein, gamma, user model
- raxmlHPC-SSE3 -s protein.phy -n A1 -m PROTGAMMAWAS -# 100 -b 0123 # protein, gamma, Whelan & Goldman (2001) model, bootstrap
- raxmlHPC-SSE3 -s protein.phy -n A1 -m PROTGAMMAWAS -o Carp,Loach # protein, gamma, Whelan & Goldman (2001) model, root on a (monophyletic group)
- raxmlHPC-SSE3 -s protein.phy -n A1 -m PROTGAMMAWAS -# autoFC -b 0123 # protein, gamma, Whelan & Goldman (2001) model, bootstrap (at least 99 splits, auto-stopping)
- raxmlHPC-SSE3 -f a -s protein.phy -n A1 -m PROTGAMMAWAG -# 100 -p 0123 -x 0123 # Rapid bootstrap with consensus
  - output 1, RAXML_bestTree.A1 (ml tree)
  - output 2, RAXMLbootstrap.A1 (bootstrap relicates)
  - output 3, RAXMLbipartitions.A1 (tree with boot strap values)

PhyloNet

R packages for phylogenetics

APE

phengorn

phytools

Population genetics

ms: coalescence simulation

SFS: forward simulation

PAML: testing selection with Ka/Ks

Microbial genome databases & pipelines in Qiu Lab

borreliabase

pa2

spiro_genomes/treponema

Blast a set of genes against a bacteria genome

download genome
extract gene sequences & translate
Make blast database
Run blastp

de novo variant call with cortex_var

Create binary file of fasta genome file.

Run contex_var_31_c1 (cutoff 1 used for 1 genome)

--se_list is the command the reads the list you want to target (ie: list-genome.txt)
--kmer_size is the middle size, has to be an odd integer
--mem_width always choose 17
--mem_height always choose 100
--dump_binary Name your file name (ie: Genome.ctx)

/home/weigang/CORTEX_release_v1.0.5.21/bin/cortex_var_31_c1 
--se_list list-Evo.txt 
--kmer_size 31 
--mem_width 17 
--mem_height 100 
--dump_binary Evo.ctx 
> Evo.log save log file

Read each binary file (.ctx) into its own individual color list (ls Evo.ctx > Evo.colorlist) Then save these lists into their own collective colorlist.txt (ls *.ctx > colorlist.txt)

Reveal genetic variation using the Bubble Caller from cortex_var.

/home/weigang/CORTEX_release_v1.0.5.21/bin/cortex_var_31_c5 
--se_list colorlist.txt 
--kmer_size 31 
--mem_width 17 
--mem_height 100 
--dump_binary all-colors.ctx 
> all-colors.log save log file

Bubble caller will detect differences between each genome by assigning distinct colors to each genome (note that the UK spelling of color is used: colour)

--multicolour_bin holds your all-colors.ctx binary from the Bubble Caller
--detect_bubbles1 i/i Detects 1 variation between genomes i and i. i indicates the position number the genome is listed on the colorlist.txt file. If the genome is fourth on the colorlist.txt, for example, its corresponding i variable is 4
--output_bubbles1 Output variant reads in fasta format (ie: Evo-RefHG.var for bubble detection between

Evolved genome and Reference HG genome)

--print_colour_coverages necessary for output

/home/weigang/CORTEX_release_v1.0.5.21/bin/cortex_var_31_c5 
--kmer_size 31  
--mem_height 17 
--mem_width 100 
--multicolour_bin all-colors.ctx 
--detect_bubbles1 0/1 
--output_bubbles1 Evo-RefHG.var 
--print_colour_coverages  
> Evo-RefHG.log save log file

Variant call with cortex_var

Example

Files

MS00018_S5_L001_R1_001.fastq
MS00018_S5_L001_R2_001.fastq
KU-090401_S3_L001_R1_001.fastq
KU-090401_S3_L001_R2_001.fastq
...

File content

@M03268:52:000000000-AJFAY:1:1101:16970:1555 1:N:0:7
CCCATGAACGGCACGTTCACGATGCAGAAGGTGGTGACCAGGCCGGTGTCGGCGACCTCGACGTAATCGTCGGTGGGGGA
GCCGTCGCGCGGGTCCGCGCCGCGCGGCGGGAAGTACACCTTGCCGTCCATGCGGGCGCGGGCGCCCAGCAGACGACCCT
CCATGAGCGCATTCAGGAATTCGGCCTCCTGCGGCGCGGCGGTGTGCTTGATATTGAAGTCGACCGGGGTCACGATGCCG
GTGACCGGTT
+
11>1>111@11>1A1FGF1FC0F0A111010GB0ECBFGF0/AFECGGFHECE??EEGHE/EEEHEFHEHHGCEECE??/
<CFCGGCCGCCCCCCHCGGCCCG@G??@@?@-@BFFFFFFFFFFF;B@FBFBFF<?@;@@-=?@??@-EFFFFF@;@@@F
FFBF/BBF;@@FFFFFFFFFFFFF@FFFFFFF<@@@@@@@@@@@FBFFFFFFFFFFFFFFFFF@@@?@?@FFFFFFFBF@
?@@EFF@@<B
@M03268:52:000000000-AJFAY:1:1101:16136:1618 1:N:0:7
TCCTGGCCCGTGAAACCGCTTGCCCGGTACAGGTTCTGGACTACCGCCTGGCACCCGAGCATCCGTTCCCGGCGGCGCTC
GACGACGCCGAGGCGGCGTTCCTGGAACTGGTGGCCGCCGGCTACCGGCCCGAACGCATTGCGGTCGCGGGTGATTCGGCCGGTGGCGGGCTCTCGCTCGCGCTGGCCGAACGGCTGCGCGACCGGCACGGGCTGGTTCCGGCCGCGCTCGGGCTGATCGCGCCCTGGGC
+
11>>11C1A@A?11BDF?EEGAFGFG?ECCHBHHGFG1EGHFHHHCCGGC/GCGGGCEECEFGHGGFFGHGGGGCECCC<CCC@CCCC@CGCCCGGC?:@EBFBF/CFFFF0/CFG?=@=-9>AFF@=@@?@;@@@F--9:BF@A@@?@--;9-BFFFF@A-@99B?-9@?=EFFBAAE;9>?;@@BF@9-@-;@=-E@@@=@@;@?>@9@<?@?BBBFFF;?>;;-@@?@<?@?9-FBFF@-99-9E-9
...

Step 1

Create matched FASTQ files (python script)

#!/usr/bin/python                                                               
  
from sys import argv
script, File1, File2 = argv

# Create a dictionary listing the sequences in the first file for reference
file1 = open(File1)
dict1 = {}
for line in file1:
    if '@M03268' in line:
        tag1 = line.rstrip()[:-9]
        tail = line.rstrip()[-9:]
        dict1[tag1] = []
    else:
        dict1[tag1].append(line.rstrip())
file1.close()

# Create two output files
f1 = open(File1.replace('.fastq', '_mat.fq'), 'w')
f2 = open(File2.replace('.fastq', '_mat.fq'), 'w')

# Match the sequence
file2 = open(File2)
for line in file2:
    if dict1.has_key(line.rstrip()[:-9]): # The has_key method
        tag1 = line.rstrip()[:-9]
        f1.write(tag1 + tail + '\n')
        for j in range(3):
            f1.write(dict1[tag1][j] + '\n')
        del dict1[tag1]
        dict2 = {} # Create a temporary dictionary for sequence in the file2
        tag2 = line.rstrip()
        dict2[tag2] = []
    else:
        dict2[tag2].append(line.rstrip())
    if len(dict2[tag2]) == 3:
        f2.write(tag2 + '\n')
        for j in range(3):
            f2.write(dict2[tag2][j] + '\n')
file2.close()

f1.close()
f2.close()

Step 2. Clean Fastq Files & Run Single-color Graph & Error Cleaning

Create a file list showing all outcome files whose extensions need to be changed from _mat.fq to .list

bbduk.sh -Xmx1g in1=fastq_file1 in2=fastq_file2 -out1=clean1.fq -out2=clean2.fq qtrim=rl trimq=20

Create a file list showing all outcome files whose extensions need to be changed from _mat.fq to .list

for f in *_mat.fq; 
do 
  title=$(echo $f | cut -d'_' -f2); 
  id=$(echo $f | cut -d'_' -f1); 
  echo $f > ${id}.list${title}; 
done

Single color graph for sample

../../CORTEX_release_v1.0.5.21/bin/cortex_var_31_c1 
--pe_list MS00018_S5.list1,MS00018_S5.list2 
--kmer_size 31 
--mem_height 17 
--mem_width 100 
--dump_binary MS00018_S5.ctx 
--sample_id MS00018_S5 
--remove_pcr_duplicates 
--quality_score_threshold 20 > MS00018_S5.log

Single color graph for reference

../../CORTEX_release_v1.0.5.21/bin/cortex_var_31_c1 
--se_list ref.filelist 
--kmer_size 31 
--mem_height 17 
--mem_width 100 
--dump_binary ref.ctx 
--sample_id ref 20 > ref.log

Run Error Cleaning for All Samples (reference is not included)

../../CORTEX_release_v1.0.5.21/bin/cortex_var_31_c1 
--mem_height 18
--mem_width 100
--kmer_size 31
--multicolour_bin N18_S15.ctx
--remove_low_coverage_supernodes 10
--dump_binary N18_S15.cleaned.ctx

Step 3

Pull the name of each .cleaned.ctx file to a cleaned.list file, then create a .filelist file for all cleaned.list files.

ls file1.cleaned.ctx > file1.cleaned.list
ls file2.cleaned.ctx > file2.cleaned.list
ls ref.ctx > ref.list
ls -1 *.list > ref-sample.filelist

Multicolour Graph

../../CORTEX_release_v1.0.5.21/bin/cortex_var_31_c3
--mem_height 20 
--mem_width 100 
--kmer_size 31 
--colour_list ref-sample.filelist 
--dump_binary ref-sample.ctx > ref-sample.log

Step 4

Variation Discovery Using The Bubble Caller

../../CORTEX_release_v1.0.5.21/bin/cortex_var_31_c3 
--mem_height 20 
--mem_width 100 
--kmer_size 31 
--multicolour_bin ref-sample.ctx 
--detect_bubbles1 -1/-1 
--ref_colour 2 
--output_bubbles1 bubbles-in-sample.out 
--print_colour_coverages 
--experiment_type EachColourAHaploidSampleExceptTheRefColour 
--genome_size 8000000 > bubbles-in-sample.log

Step 5

Reference Genome

(When in Cluster execute "module load stampy": doesn't work; path problem) (Run the following on wallace:)

stampy.py -G ref ref.fa
stampy.py -g ref -H ref

Turn Into VCF with reference

Make a sample list file (from bubble or multicolor log file): cat e1-bubbles-in-sample.log | grep CLEANED | cut -f2 > e1.sample.lis Customize the following command based on your output files, num of colors, index of ref colors, etc perl /home/weigang/CORTEX_release_v1.0.5.21/scripts/analyse_variants/process_calls-wq.pl --callfile e1-bubbles-in-sample.out --callfile_log e1-bubbles-in-sample.log -outvcf e1-bubbles-in-sample --outdir e1-vcfout --samplename_list e1.sample.list --num_cols 7 --stampy_bin /home/weigang/stampy-1.0.28/stampy.py --stampy_hash ref --refcol 6 --vcftools_dir /usr/local/bin --caller BC --kmer 31 --ploidy 1

Step 6. Parse VCF files

Filter out low-quality sites:

vcftools --vcf pat-5.decomp.vcf --keep-filtered PASS --recode --out pat-5

Extract coverage

vcftools --vcf pat-5.recode.vcf --extract-FORMAT-info COV (output file: out.COV.FORMAT)

Extract genotypes

vcftools --vcf pat-5.recode.vcf --extract-FORMAT-info GT (output file: out.GT.FORMAT)

Extract confidence

vcftools --vcf pat-5.recode.vcf --extract-FORMAT-info GT_CONF (output file: out.GT_CONF.FORMAT)

Send GT_CONF file to R and visualize log10(conf) distribution with boxplot
Use custom PERL file to filter out low-quality (e.g., GT_CONF < 30) genotype calls (flag with "?" or "na"), and make haplotype GT ("1/1" to "1", "0/0" to "0", "./." to "?)
Verify variants using IGV (see IGV protocol above)

Step 7. Variant Annotation & Visualization

vcf_parser.py out.decomp.vcf ref.gb (the code needs validation)
Data to database & web visualization, if necessary

hmmer

Annotate proteins with TIGRFAM

hmmsearch --tblout foo.hmmout   # table output for all sequences
          --domtblout foo.dmout # table output for all domains
          -E 0.01               # level of sequence significance
          --domE 0.01           # level of domain significance
          -o /dev/null          # don't show STDOUT 
          ../../TIGRFAMs-Release-15-Sep-17-2014/TIGRFAMs_15.0_HMM.LIB # HMM profile library for tiger fams 
          GCA_000583715.2_ASM58371v2_protein.faa &                    # input/query file in FASTA

PopGenome

library(PopGenome)
g = readVCF("pvt1.recode.vcf.gz", 1000, "8", 127890000, 128102000)
pops = split(sample[,1], sample[,2]) # create a list of populations
g = set.populations(g, pops, diploid = T) # set population names

# by windows
slide = sliding.window.transform(g, width = 100, jump = 20) # nsnps, not actual length
slide = F_ST.stats(slide, mode = "nucleotide")
snp.pos = slide@region.data@biallelic.sites # SNP positions
win.num = length(slide@region.names)
win.start = numeric()
for (i in 1:win.num) {win.start[i] = snp.pos[[i]][1]}
fst = slide@nuc.F_ST.vs.all
pop.names = names(slide@populations) # population names
plot(win.start, fst[,1], type ="n", las = 1, ylab = expression(F[st]), xlab = "SNP Position", ylim = c(0,0.4))
for (i in 1:length(slide@populations)) {
  lines(win.start, fst[,i], type = "l", col = pop.group[pop.names[i],4])
}
arch.coords=c(127982050, 127992931)
abline(v = arch.coords, col = "orange")
#rect(xleft = arch.coords[1], ybottom = -1, xright = arch.coords[2], ytop = 0.5, border = "transparent", col = 2)

Velvet

Cleaning

../../bbmap/bbduk.sh -Xmx1g 
           in1=WGC067462_hunhewenku_509_combined_R1.fastq # gz file works as well
           in2=WGC067462_hunhewenku_509_combined_R2.fastq # gz file works as well
           -out1=clean1.fq 
           -out2=clean2.fq 
           qtrim=rl
           trimq=20

Running Velvet Optimizer

srun ../../VelvetOptimiser-2.2.5/VelvetOptimiser.pl 
          --t 32
          --s 31 --e 31 --x 6 # kmer sizes
          -f '-shortPaired -fastq clean1.fq -shortPaired2 -fastq clean2.fq'
          -t 4 
          --optFuncKmer ‘n50’ 
          -p prefix

PacBio assembly with canu

./canu -p staph-auto-5 -d staph-auto-5 genomeSize=2.2m -pacbio-raw pac-reads-5.tar.gz

@@ Line 593: / Line 593: @@
 </syntaxhighlight>
 * Turn Into VCF with reference
-<syntaxhighlight lang="bash">
+Make a sample list file (from bubble or multicolor log file):
-perl ~/CORTEX_release_v1.0.5.21/scripts/analyse_variants/process_calls-wq.pl
+<code>cat e1-bubbles-in-sample.log | grep CLEANED | cut -f2 > e1.sample.lis</code>
---callfile bubbles-in-samples.out2
+Customize the following command based on your output files, num of colors, index of ref colors, etc
---callfile_log bubbles-in-samples.log
+<code>
---outvcf bubbles-in-samples
+perl /home/weigang/CORTEX_release_v1.0.5.21/scripts/analyse_variants/process_calls-wq.pl --callfile e1-bubbles-in-sample.out --callfile_log e1-bubbles-in-sample.log -outvcf e1-bubbles-in-sample --outdir e1-vcfout --samplename_list e1.sample.list --num_cols 7 --stampy_bin /home/weigang/stampy-1.0.28/stampy.py --stampy_hash ref --refcol 6 --vcftools_dir /usr/local/bin --caller BC --kmer 31 --ploidy 1
---outdir vcfout
+</code>
---samplename_list sample.list
---num_cols 17
---stampy_bin ~/stampy-1.0.28/stampy.py
---stampy_hash ref
---refcol 2
---vcftools_dir /usr/local/bin
---caller BC
---kmer 31
---ploidy 1
-</syntaxhighlight>
 ==Step 6. Parse VCF files==

Mini-Tutorals: Difference between revisions

Revision as of 19:11, 25 July 2017

An R function to transform a wide data frame to a long one=

Variant verification using IGV

KRAKEN

PATRIC microbial genome annotation CLI

Parsimony reconstruction using MPR

ospC amplicon identification

Running a Screen Session

use byobu

use screen

Bp-utils: sequence, alignment & tree utilities by Qiu Lab

bioseq: sequence/FASTA manipulations

bioaln: alignment/CLUSTALW manipulations

biotree: tree/NEWICK manipulations

biopop: SNP statistics

Homology searching and clustering

BLAST+: search("google") for homologs/pariwise alignment

hmmer

cdhit

interproscan

Programs for producing multiple alignments

MUSCLE

MUGSY

CLUSTALW

MAFT

TCOFFEE

Programs for producing phylogeny & phylogenetic analysis

FastTree

PHYLIP

MrBayes

RaXML

PhyloNet

R packages for phylogenetics

APE

phengorn

phytools

Population genetics

ms: coalescence simulation

SFS: forward simulation

PAML: testing selection with Ka/Ks

Microbial genome databases & pipelines in Qiu Lab

borreliabase

pa2

spiro_genomes/treponema

Blast a set of genes against a bacteria genome

de novo variant call with cortex_var

Variant call with cortex_var

Example

Step 1

Step 2. Clean Fastq Files & Run Single-color Graph & Error Cleaning

Step 3

Step 4

Step 5

Step 6. Parse VCF files

Step 7. Variant Annotation & Visualization

hmmer

Annotate proteins with TIGRFAM

PopGenome

Velvet

Cleaning

Running Velvet Optimizer

PacBio assembly with canu

Navigation menu

Search