Mini-Tutorals and Southwest-University: Difference between pages

From QiuLab
(Difference between pages)
Jump to navigation Jump to search
imported>Saymon
No edit summary
 
imported>Weigang
 
Line 1: Line 1:
==bp-utils: bioseq==
<center>'''Biomedical Genomics'''</center>
* Use accession "CP002316.1" to retrieve the Genbank file from NCBI. Save the output (in genbank format) to a file named as "cp002316.gb".
<center>July 8-19, 2019</center>
<div class="toccolours mw-collapsible">
<center>'''Instructor:''' Weigang Qiu, Ph.D.<br>Professor, Department of Biological Sciences, City University of New York, Hunter College & Graduate Center<br>Adjunct Faculty, Department of Physiology and Biophysics,
<syntaxhighlight lang=bash">
Institute for Computational Biomedicine, Weil Cornell Medical College</center>
bioseq -f "CP002316.1" -o'genbank' > cp002316.gb
<center>'''Office:''' B402 Belfer Research Building, 413 East 69th Street, New York, NY 10021, USA</center>
</syntaxhighlight>
<center>'''Email:''' weigang@genectr.hunter.cuny.edu</center>
</div>
<center>'''Lab Website:''' http://diverge.hunter.cuny.edu/labwiki/</center>
* Use the above file as input, extract FASTA sequences for each genes and save the output to a new file called "cp002316.nuc". Use this file for the following questions.
<br>
<div class="toccolours mw-collapsible">
<center>'''Host''': Shunqin Zhu (祝顺琴), Ph.D.<br>Associate Professor, School of  Life Science, South West University</center>
<syntaxhighlight lang=bash">
----
bioseq -i "genbank" -F cp002316.gb > cp002316.fas
[[File:Lp54-gain-loss.png|300px|thumbnail|Figure 1. Gains & losses of host-defense genes among Lyme pathogen genomes ([https://www.ncbi.nlm.nih.gov/pubmed/24704760 Qiu & Martin 2014])]]
</syntaxhighlight>
==Course Overview==
</div>
Welcome to BioMedical Genomics, a computer workshop for advanced undergraduates and graduate students. A genome is the total genetic content of an organism. Driven by breakthroughs such as the decoding of the first human genome and next-generation DNA -sequencing technologies, biomedical sciences are undergoing a rapid and irreversible transformation into a highly data-intensive field.  
* Count the number of sequences.  
 
<div class="toccolours mw-collapsible">
Genome information is revolutionizing virtually all aspects of life sciences including basic research, medicine, and agriculture. Meanwhile, use of genomic data requires life scientists to be familiar with concepts and skills in biology, computer science, as well as data analysis.  
<syntaxhighlight lang=bash">
 
bioseq -n cp002316.fas
This workshop is designed to introduce computational analysis of genomic data through hands-on computational exercises, using published studies.
</syntaxhighlight>
 
</div>
The pre-requisites of the course are college-level courses in molecular biology, cell biology, and genetics. Introductory courses in computer programming and statistics are preferred but not strictly required.
* In a single command, pick the first 10 sequences and find their length
 
<div class="toccolours mw-collapsible">
==Learning goals==
<syntaxhighlight lang=bash">
By the end of this course successful students will be able to:
bioseq -p "order:1-10" cp002316.fas | bioseq –l
* Describe next-generation sequencing  (NGS) technologies & contrast it with traditional Sanger sequencing
</syntaxhighlight>
* Explain applications of NGS technology including pathogen genomics, cancer genomics, human genomic variation, transcriptomics, meta-genomics, epi-genomics, and microbiome.
</div>
* Visualize and explore genomics data using RStudio
* In a single command, pick the third and seventh sequences from the file and do the 3-frame translation. Which reading frame is the correct on both? Specify
* Replicate key results using a raw data set produced by a primary research paper
<div class="toccolours mw-collapsible">
 
<syntaxhighlight lang=bash">
==Web Links==
bioseq -p "order:3,7" cp002316.fas | bioseq -t3
* Install R base: https://cloud.r-project.org
</syntaxhighlight>
* Install R Studio (Desktop version): http://www.rstudio.com/download
</div>
* Download: [http://www.r4all.org/books/datasets R datasets]
* Find the base composition of the last two sequences
* A reference book: [https://r4ds.had.co.nz/ R for Data Science (Wickharm & Grolemund)]
<div class="toccolours mw-collapsible">
 
<syntaxhighlight lang=bash">
==Quizzes and Exams==
bioseq -p "order:25-26" cp002316.fas| bioseq –c
Student performance will be evaluated by attendance, three (4) quizzes and a final report:
</syntaxhighlight>
* Attendance: 50 pts
</div>
* Assignments: 5 x 10 = 50 pts
* Pick the sequence with id "Bbu|D1_B11|8784|9302|1" and count the number of codons present in this sequence
* Quizzes: 2 x 25 pts = 50 pts
<div class="toccolours mw-collapsible">
* Mid-term: 50 pts
<syntaxhighlight lang=bash">
* Final presentation: 50 pts
bioseq -p "id:BbuJD1_B11|8784|9302|1" cp002316.fas | bioseq –C
Total: 250 pts
</syntaxhighlight>
 
</div>
==Course Schedule==
* Delete the last 10 sequences from the file and save the output to cp002316-v2.nuc
{| class="wikitable"
<div class="toccolours mw-collapsible">
|-
<syntaxhighlight lang=bash">
! Date & Hour !! Tutorials !! Assignment !! Quiz & Exam
bioseq -d "order:17-26" cp002316.fas > cp002316-v2.nuc
|-
</syntaxhighlight>
| July 8 (Mon), 8:40-12:10 || Introduction; R Tutorial I;
</div>
[[File:R-part-1-small.pdf|thumbnail|Lecture slides]]
* In a single command, pick the first sequence, then get the 50-110 nucleotides and make reverse complement of the sub-sequences
||
<div class="toccolours mw-collapsible">
Assignment #1 (create a WORD document including scripts & graphs)
<syntaxhighlight lang=bash">
* Install R/R studio and the "tidyverse" package on your own computer
  bioseq -p "order:1" cp002316.fas | bioseq -s "50,110" | bioseq –r
* Recreate Script 1 & Mini-Practical
</syntaxhighlight>
* Show help page for function "seq"
</div>
* Download dataset
* In a single command, get the first 100 nucleotides of all the sequences present in the file and do 1-frame translation of all sub-sequences.
** Create a new folder (e.g., Desktop/rtutor)
<div class="toccolours mw-collapsible">
** Create a sub-folder (e.g., Desktop/rtutor/data/)
<syntaxhighlight lang=bash">
** Download from http://www.r4all.org/the-book/datasets
bioseq -s "1,50" cp002316.fas | bioseq -t1
** Save to the sub-folder
</syntaxhighlight>
** Unzip the file
</div>
 
  ||
|-
| July 9 (Tu), 8:40-12:10 || NGS; R Tutorial II ||  
Assignment #2
* List pros & cons of Sanger vs NGS
* Compare accuracy, read length, and error rate between Illumina and PacBio
* Describe sequence information captured with each of the following file formats: FASTA, FASTQ, SAM, VCF
* Wide vs Tall data frames
* Variable names (informative, case sensitive)
* Read file
||
|-
| July 10 (Wed), 8:40-12:10 || Microbiome I; R Tutorial III ||
Assignment #3
|| Quiz I
|-
| July 11 (Thur), 8:40-12:10 || Microbiome II; R Tutorial IV ||  
Assignment #4
||
|-
| July 12 (Fri), 8:40-12:10 ||  || || Mid-term Exam
|-
| Weekend || Break
|-
| July 15 (Mon), 8:00-12:10 || Transcriptome; R Tutorial V ||  
Assignment #5
||
|-
| July 16 (Tu), 8:00-12:10 || Proteome ||
||
|-
| July 17 (Wed), 8:00-12:10 || Genomics I ||
|| Quiz II
|-
| July 18 (Thur), 8:00-12:10 || Genomics II  || ||
|-
| July 19 (Fri), 8:00-12:10|| Presentations
|}
 
==Papers & Datasets==
{| class="wikitable sortable"
|-
! Omics Application !! Paper link !! Data set !! NGS Technology
|-
| Microbiome || [https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0193652 Rimoldi_etal_2018_PlosOne] || [https://doi.org/10.1371/journal.pone.0193652.s004 S1 Dataset] || 16S rDNA amplicon sequencing
|-
| Transcriptome || [https://science.sciencemag.org/content/350/6264/1096 Wang_etal_2015_Science] || Tables S2 & S4 || RNA-Seq
|-
| Transcriptome & Regulome || [https://bmcmedgenomics.biomedcentral.com/articles/10.1186/s12920-019-0477-8 Nava_etal_2019_BMCGenomics] || Tables S2 & S3 || RNA-Seq & CHIP-Seq
|-
| Proteome || [https://www.ncbi.nlm.nih.gov/pubmed/28232952 Qiu_etal_2017_NPJ] || (to be posted) || SILAC
|-
| Population genomics (Lyme) || [https://jcm.asm.org/content/56/11/e00940-18.long Di_etal_2018_JCM] || [https://github.com/weigangq/ocseq Data & R codes] || Amplicon sequencing (antigen locus)
|-
| Population genomics/GWAS (Human) || [https://science.sciencemag.org/content/351/6274/737.long Simonti_etal_2016_Science] || [https://science.sciencemag.org/highwire/filestream/673591/field_highwire_adjunct_files/1/aad2149-Simonti-SM.Table.S2.xlsx Table S2] || whole-genome sequencing (WGS); [http://www.internationalgenome.org/ 1000 Genome Project (IGSR)]
|-
| TB surveillance || [https://jcm.asm.org/content/53/7/2230 Brow_etal_2015]  || [https://www.ebi.ac.uk/ena/data/view/PRJEB9206 Sequence Archives]|| Whole-genome sequencing (WGS)
|-
| Example || Example || Example || Example
|-
| Example || Example || Example || Example
|-
| Example || Example || Example || Example
|}

Revision as of 06:05, 8 July 2019

Biomedical Genomics
July 8-19, 2019
Instructor: Weigang Qiu, Ph.D.
Professor, Department of Biological Sciences, City University of New York, Hunter College & Graduate Center
Adjunct Faculty, Department of Physiology and Biophysics, Institute for Computational Biomedicine, Weil Cornell Medical College
Office: B402 Belfer Research Building, 413 East 69th Street, New York, NY 10021, USA
Email: weigang@genectr.hunter.cuny.edu
Lab Website: http://diverge.hunter.cuny.edu/labwiki/


Host: Shunqin Zhu (祝顺琴), Ph.D.
Associate Professor, School of Life Science, South West University

Figure 1. Gains & losses of host-defense genes among Lyme pathogen genomes (Qiu & Martin 2014)

Course Overview

Welcome to BioMedical Genomics, a computer workshop for advanced undergraduates and graduate students. A genome is the total genetic content of an organism. Driven by breakthroughs such as the decoding of the first human genome and next-generation DNA -sequencing technologies, biomedical sciences are undergoing a rapid and irreversible transformation into a highly data-intensive field.

Genome information is revolutionizing virtually all aspects of life sciences including basic research, medicine, and agriculture. Meanwhile, use of genomic data requires life scientists to be familiar with concepts and skills in biology, computer science, as well as data analysis.

This workshop is designed to introduce computational analysis of genomic data through hands-on computational exercises, using published studies.

The pre-requisites of the course are college-level courses in molecular biology, cell biology, and genetics. Introductory courses in computer programming and statistics are preferred but not strictly required.

Learning goals

By the end of this course successful students will be able to:

  • Describe next-generation sequencing (NGS) technologies & contrast it with traditional Sanger sequencing
  • Explain applications of NGS technology including pathogen genomics, cancer genomics, human genomic variation, transcriptomics, meta-genomics, epi-genomics, and microbiome.
  • Visualize and explore genomics data using RStudio
  • Replicate key results using a raw data set produced by a primary research paper

Web Links

Quizzes and Exams

Student performance will be evaluated by attendance, three (4) quizzes and a final report:

  • Attendance: 50 pts
  • Assignments: 5 x 10 = 50 pts
  • Quizzes: 2 x 25 pts = 50 pts
  • Mid-term: 50 pts
  • Final presentation: 50 pts

Total: 250 pts

Course Schedule

Date & Hour Tutorials Assignment Quiz & Exam
July 8 (Mon), 8:40-12:10 Introduction; R Tutorial I;

Assignment #1 (create a WORD document including scripts & graphs)

  • Install R/R studio and the "tidyverse" package on your own computer
  • Recreate Script 1 & Mini-Practical
  • Show help page for function "seq"
  • Download dataset
July 9 (Tu), 8:40-12:10 NGS; R Tutorial II

Assignment #2

  • List pros & cons of Sanger vs NGS
  • Compare accuracy, read length, and error rate between Illumina and PacBio
  • Describe sequence information captured with each of the following file formats: FASTA, FASTQ, SAM, VCF
  • Wide vs Tall data frames
  • Variable names (informative, case sensitive)
  • Read file
July 10 (Wed), 8:40-12:10 Microbiome I; R Tutorial III

Assignment #3

Quiz I
July 11 (Thur), 8:40-12:10 Microbiome II; R Tutorial IV

Assignment #4

July 12 (Fri), 8:40-12:10 Mid-term Exam
Weekend Break
July 15 (Mon), 8:00-12:10 Transcriptome; R Tutorial V

Assignment #5

July 16 (Tu), 8:00-12:10 Proteome
July 17 (Wed), 8:00-12:10 Genomics I Quiz II
July 18 (Thur), 8:00-12:10 Genomics II
July 19 (Fri), 8:00-12:10 Presentations

Papers & Datasets

Omics Application Paper link Data set NGS Technology
Microbiome Rimoldi_etal_2018_PlosOne S1 Dataset 16S rDNA amplicon sequencing
Transcriptome Wang_etal_2015_Science Tables S2 & S4 RNA-Seq
Transcriptome & Regulome Nava_etal_2019_BMCGenomics Tables S2 & S3 RNA-Seq & CHIP-Seq
Proteome Qiu_etal_2017_NPJ (to be posted) SILAC
Population genomics (Lyme) Di_etal_2018_JCM Data & R codes Amplicon sequencing (antigen locus)
Population genomics/GWAS (Human) Simonti_etal_2016_Science Table S2 whole-genome sequencing (WGS); 1000 Genome Project (IGSR)
TB surveillance Brow_etal_2015 Sequence Archives Whole-genome sequencing (WGS)
Example Example Example Example
Example Example Example Example
Example Example Example Example