Revision as of 20:55, 25 January 2016

Bp-utils: sequence, alignment & tree utilities by Qiu Lab

bioseq: sequence/FASTA manipulations

Use accession "CP002316.1" to retrieve the Genbank file from NCBI. Save the output (in genbank format) to a file named as "cp002316.gb".

bioseq -f "CP002316.1" -o'genbank' > cp002316.gb

Use the above file as input, extract FASTA sequences for each genes and save the output to a new file called "cp002316.nuc". Use this file for the following questions.

bioseq -i "genbank" -F cp002316.gb > cp002316.fas

Count the number of sequences.

bioseq -n cp002316.fas

In a single command, pick the first 10 sequences and find their length

bioseq -p "order:1-10" cp002316.fas | bioseq –l

In a single command, pick the third and seventh sequences from the file and do the 3-frame translation. Which reading frame is the correct on both? Specify

bioseq -p "order:3,7" cp002316.fas | bioseq -t3

Find the base composition of the last two sequences

bioseq -p "order:25-26" cp002316.fas| bioseq –c

Pick the sequence with id "Bbu|D1_B11|8784|9302|1" and count the number of codons present in this sequence

bioseq -p "id:BbuJD1_B11|8784|9302|1" cp002316.fas | bioseq –C

Delete the last 10 sequences from the file and save the output to cp002316-v2.nuc

bioseq -d "order:17-26" cp002316.fas > cp002316-v2.nuc

In a single command, pick the first sequence, then get the 50-110 nucleotides and make reverse complement of the sub-sequences

bioseq -p "order:1" cp002316.fas | bioseq -s "50,110" | bioseq –r

In a single command, get the first 100 nucleotides of all the sequences present in the file and do 1-frame translation of all sub-sequences.

bioseq -s "1,100" cp002316.fas | bioseq -t1

bioaln: alignment/CLUSTALW manipulations

Go to /home/shared/LabMeetingReadings/Test-Data and find the sequence alignment file “bioaln_tutorial.aln”. Name the format of the alignment file. Use it to answer all the questions below.

CLUSTALW

Find the length of the alignment.

bioaln -l bioaln_tutorial.aln

Count the number of the sequences present in the alignment.

bioaln -n bioaln_tutorial.aln

How do you convert this alignment in phylip format? Save the output.

bioaln -o "phylip" bioaln_tutorial.aln > test.phy

Pick “seq2, seq5, seq7, seq10” from the alignment and calculate their average percent identity.

bioaln -p "seq2, seq5, seq7, seq10" bioaln_tutorial.aln | bioaln -a

Get an alignment slice from “50-140” and find the average identities of the slice for sliding windows of 25.

bioaln -s "50, 140" bioaln_tutorial.aln | bioaln -w "25"

Extract conserved blocks from the alignment.

bioaln -B bioaln_tutorial.aln

Find the unique sequences and list their ids.

bioaln -u bioaln_tutorial.aln | bioaln -L

Extract third sites from the alignment and show only variable sites in match view.

bioaln -T bioaln_tutorial.aln | bioaln -v | bioaln -m

Remove the gaps and show the final alignment in codon view for an alignment slice “1-100”.

 bioaln -s "1, 100" bioaln_tutorial.aln | bioaln -g | bioaln -c

Add a 90% consensus sequence and then show the final alignment in match plus codon view for an alignment slice “20-80”. (Hint: match view followed by codon view)

bioaln -s "20, 80" bioaln_tutorial.aln | bioaln -C "90" | bioaln -m | bioaln -c

biotree: tree/NEWICK manipulations

biopop: SNP statistics

Homology searching and clustering

BLAST+: search("google") for homologs/pariwise alignment

hmmer

cdhit

cdhit -i all.pep -o all.cdhit -c 0.5 -n 3

Options:

-i: input file
-o: output file
-c: percent identity (below which it is considered different families)
-n: word length

interproscan

../../software/interproscan/interproscan-5.13-52.0/interproscan.sh -i trep-cdhit.representatives.pep2 -o  trep-representatives.tsv -t p -goterms -pa -f tsv

Documentation page: How to run

Programs for producing multiple alignments

MUSCLE

CLUSTALW

MAFT

TCOFFEE

Programs for producing phylogeny & phylogenetic analysis

FastTree

PHYLIP

MrBayes

RaXML

PhyloNet

R packages for phylogenetics

APE

phengorn

phytools

Population genetics

ms: coalescence simulation

SFS: forward simulation

PAML: testing selection with Ka/Ks

Microbial genome databases & pipelines in Qiu Lab

borreliabase

pa2

spiro_genomes/treponema

Genome annotation pipeline

de novo variant call with cortex_var

Create binary file of fasta genome file.

Run contex_var_31_c1 (cutoff 1 used for 1 genome)

--se_list is the command the reads the list you want to target (ie: list-genome.txt)
--kmer_size is the middle size, has to be an odd integer
--mem_width always choose 17
--mem_height always choose 100
--dump_binary Name your file name (ie: Genome.ctx)

/home/weigang/CORTEX_release_v1.0.5.21/bin/cortex_var_31_c1 --se_list list-Evo.txt --kmer_size 31 --mem_width 17 --mem_height 100 --dump_binary Evo.ctx > Evo.log

Read each binary file (.ctx) into its own individual color list (ls Evo.ctx > Evo.colorlist) Then save these lists into their own collective colorlist.txt (ls *.ctx > colorlist.txt)

Reveal genetic variation using the Bubble Caller from cortex_var.

/home/weigang/CORTEX_release_v1.0.5.21/bin/cortex_var_31_c5 --se_list colorlist.txt --kmer_size 31 --mem_width 17 --mem_height 100 --dump_binary all-colors.ctx > all-colors.log

Bubble caller will detect differences between each genome by assigning distinct colors to each genome (note that the UK spelling of color is used: colour)

--multicolour_bin holds your all-colors.ctx binary from the Bubble Caller
--detect_bubbles1 i/i Detects 1 variation between genomes i and i. i indicates the position number the genome is listed on the colorlist.txt file. If the genome is fourth on the colorlist.txt, for example, its corresponding i variable is 4
--output_bubbles1 Output variant reads in fasta format (ie: Evo-RefHG.var for bubble detection between

Evolved genome and Reference HG genome)

--print_colour_coverages necessary for output

/home/weigang/CORTEX_release_v1.0.5.21/bin/cortex_var_31_c5 --kmer_size 31  --mem_height 17 --mem_width 100 --multicolour_bin all-colors.ctx --detect_bubbles1 0/1 --output_bubbles1 Evo-RefHG.var --print_colour_coverages  > Evo-RefHG.log

Variant call with cortex_var

Example

Files

MS00018_S5_L001_R1_001.fastq
MS00018_S5_L001_R2_001.fastq
KU-090401_S3_L001_R1_001.fastq
KU-090401_S3_L001_R2_001.fastq
...

File content

@M03268:52:000000000-AJFAY:1:1101:16970:1555 1:N:0:7
CCCATGAACGGCACGTTCACGATGCAGAAGGTGGTGACCAGGCCGGTGTCGGCGACCTCGACGTAATCGTCGGTGGGGGA
GCCGTCGCGCGGGTCCGCGCCGCGCGGCGGGAAGTACACCTTGCCGTCCATGCGGGCGCGGGCGCCCAGCAGACGACCCT
CCATGAGCGCATTCAGGAATTCGGCCTCCTGCGGCGCGGCGGTGTGCTTGATATTGAAGTCGACCGGGGTCACGATGCCG
GTGACCGGTT
+
11>1>111@11>1A1FGF1FC0F0A111010GB0ECBFGF0/AFECGGFHECE??EEGHE/EEEHEFHEHHGCEECE??/
<CFCGGCCGCCCCCCHCGGCCCG@G??@@?@-@BFFFFFFFFFFF;B@FBFBFF<?@;@@-=?@??@-EFFFFF@;@@@F
FFBF/BBF;@@FFFFFFFFFFFFF@FFFFFFF<@@@@@@@@@@@FBFFFFFFFFFFFFFFFFF@@@?@?@FFFFFFFBF@
?@@EFF@@<B
@M03268:52:000000000-AJFAY:1:1101:16136:1618 1:N:0:7
TCCTGGCCCGTGAAACCGCTTGCCCGGTACAGGTTCTGGACTACCGCCTGGCACCCGAGCATCCGTTCCCGGCGGCGCTC
GACGACGCCGAGGCGGCGTTCCTGGAACTGGTGGCCGCCGGCTACCGGCCCGAACGCATTGCGGTCGCGGGTGATTCGGCCGGTGGCGGGCTCTCGCTCGCGCTGGCCGAACGGCTGCGCGACCGGCACGGGCTGGTTCCGGCCGCGCTCGGGCTGATCGCGCCCTGGGC
+
11>>11C1A@A?11BDF?EEGAFGFG?ECCHBHHGFG1EGHFHHHCCGGC/GCGGGCEECEFGHGGFFGHGGGGCECCC<CCC@CCCC@CGCCCGGC?:@EBFBF/CFFFF0/CFG?=@=-9>AFF@=@@?@;@@@F--9:BF@A@@?@--;9-BFFFF@A-@99B?-9@?=EFFBAAE;9>?;@@BF@9-@-;@=-E@@@=@@;@?>@9@<?@?BBBFFF;?>;;-@@?@<?@?9-FBFF@-99-9E-9
...

Step 1

Create matched FASTQ files (python script)

Step 2

Create a file list showing all outcome files whose extensions need to be changed from _mat.fq to .list

for f in *_mat.fq; 
do 
  title=$(echo $f | cut -d'_' -f2); 
  id=$(echo $f | cut -d'_' -f1); 
  echo $f > ${id}.list${title}; 
done

Step 3

Error cleaning(python script)

/home/weigang/CORTEX_release_v1.0.5.21/bin/cortex_var_31_c5 --mem_height 18 --mem_width 100 --kmer_size 31 --multicolour_bin N18_S15.ctx --remove_low_coverage_supernodes 10 --dump_binary MS00018_S5.cleaned.ctx

Step 4

Compile a new binary for two colors

make NUM_COLS=2 MAXK=31 cortex_var

Step 5

Create the ref-samples.filelist

echo ref.ctx > ref_binary.filelist
ls -1 *.cleaned.ctx > sample_binary.filelist
ls -1 *binary.filelist > ref-samples.filelist

Multicolour Graph

/home/weigang/CORTEX_release_v1.0.5.21/bin/cortex_var_31_c5 --mem_height 20 --mem_width 100 --kmer_size 31 --colour_list ref-samples.filelist --dump_binary ref-samples.ctx

Step 6

Variation Discovery Using The Bubble Caller

/home/weigang/CORTEX_release_v1.0.5.21/bin/cortex_var_31_c5 --mem_height 20 --mem_width 100 --kmer_size 31 --multicolour_bin ref-samples.ctx --detect_bubbles1 -1/-1 --exclude_ref_bubbles --ref_colour 0 --output_bubbles1 bubbles-in-samples.out2 --print_colour_coverages --experiment_type EachColourAHaploidSampleExceptTheRefColour --genome_size 8000000 > bubbles-in-samples.log

Step 7

Reference Genome

module load stampy
stampy.py -G N-2927-WGS N-2927-WGS/GCA_000583715.2_ASM58371v2_genomic.fna
stampy.py -g N-2927-WGS -H N-2927-WGS

Turn Into VCF

perl ~/CORTEX_release_v1.0.5.21/scripts/analyse_variants/process_calls-wq.pl --callfile bubbles-in-samples.out2 --callfile_log bubbles-in-samples.log --outvcf bubbles-in-samples --outdir vcfout --samplename_list sample.list --num_cols 17 --stampy_bin ~/stampy-1.0.28/stampy.py --vcftools_dir /usr/local/bin --caller BC --kmer 31 --ploidy 1

@@ Line 253: / Line 253: @@
 # Create a file list showing all outcome files whose extensions need to be changed from _mat.fq to .list
 <syntaxhighlight lang="bash">
 for f in *_mat.fq;
 do
@@ Line 260: / Line 259: @@
    echo $f > ${id}.list${title};
 done
+</syntaxhighlight>
+==Step 3==
+# Error cleaning(python script)
+<syntaxhighlight>
+/home/weigang/CORTEX_release_v1.0.5.21/bin/cortex_var_31_c5 --mem_height 18 --mem_width 100 --kmer_size 31 --multicolour_bin N18_S15.ctx --remove_low_coverage_supernodes 10 --dump_binary MS00018_S5.cleaned.ctx
+</syntaxhighlight>
+==Step 4==
+# Compile a new binary for two colors
+<syntaxhighlight>
+make NUM_COLS=2 MAXK=31 cortex_var
+</syntaxhighlight>
+==Step 5==
+# Create the ref-samples.filelist
+<syntaxhighlight>
+echo ref.ctx > ref_binary.filelist
+ls -1 *.cleaned.ctx > sample_binary.filelist
+ls -1 *binary.filelist > ref-samples.filelist
+</syntaxhighlight>
+# Multicolour Graph
+<syntaxhighlight>
+/home/weigang/CORTEX_release_v1.0.5.21/bin/cortex_var_31_c5 --mem_height 20 --mem_width 100 --kmer_size 31 --colour_list ref-samples.filelist --dump_binary ref-samples.ctx
+</syntaxhighlight>
+==Step 6==
+# Variation Discovery Using The Bubble Caller
+<syntaxhighlight>
+/home/weigang/CORTEX_release_v1.0.5.21/bin/cortex_var_31_c5 --mem_height 20 --mem_width 100 --kmer_size 31 --multicolour_bin ref-samples.ctx --detect_bubbles1 -1/-1 --exclude_ref_bubbles --ref_colour 0 --output_bubbles1 bubbles-in-samples.out2 --print_colour_coverages --experiment_type EachColourAHaploidSampleExceptTheRefColour --genome_size 8000000 > bubbles-in-samples.log
 </syntaxhighlight>
-==Step 2==
+==Step 7==
-# Generate single-color graphs
+# Reference Genome
-<syntaxhighlight lang="bash">
+<syntaxhighlight>
-/home/weigang/CORTEX_release_v1.0.5.21/bin/cortex_var_31_c5 --se_list colorlist.txt --kmer_size 31 --mem_width 17 --mem_height 100 --dump_binary all-colors.ctx > all-colors.log
+module load stampy
+stampy.py -G N-2927-WGS N-2927-WGS/GCA_000583715.2_ASM58371v2_genomic.fna
+stampy.py -g N-2927-WGS -H N-2927-WGS
+</syntaxhighlight>
+# Turn Into VCF
+<syntaxhighlight>
+perl ~/CORTEX_release_v1.0.5.21/scripts/analyse_variants/process_calls-wq.pl --callfile bubbles-in-samples.out2 --callfile_log bubbles-in-samples.log --outvcf bubbles-in-samples --outdir vcfout --samplename_list sample.list --num_cols 17 --stampy_bin ~/stampy-1.0.28/stampy.py --vcftools_dir /usr/local/bin --caller BC --kmer 31 --ploidy 1
 </syntaxhighlight>
-==Step 3==

Mini-Tutorals: Difference between revisions