Biol375 2019 and BioMed-R: Difference between pages

From QiuLab
(Difference between pages)
Jump to navigation Jump to search
imported>Lab
m (change package spelling to phangorn)
 
imported>Weigang
 
Line 1: Line 1:
<center>'''Molecular Evolution''' (BIOL 375.00/790.64/793.03, Fall 2019)</center>
<center>'''BIOL47120 Biomedical Genomics II'''</center>
<center>'''Instructor:''' Dr Weigang Qiu, Professor, Department of Biological Sciences </center>
<center>Spring 2019</center>
<center>'''Room:''' 926 HN (Seminar Room, North Building)</center>
<center>'''Instructor:''' Weigang Qiu, Ph.D.<br>Professor, Department of Biological Sciences, City University of New York, Hunter College & Graduate Center<br>Adjunct Faculty, Department of Physiology and Biophysics,
<center>'''Hours:''' Mon. & Thur 4:10-5:25 pm</center>
Institute for Computational Biomedicine, Weil Cornell Medical College</center>
<center>'''Office Hours:''' Belfer Research Building ([https://www.google.com/maps/place/413+E+69th+St,+New+York,+NY+10021/@40.7655886,-73.9561743,17z/data=!3m1!4b1!4m2!3m1!1s0x89c258c3d235f76f:0x4f3d0d5d8a78fe6?hl=en Google Map]) BB-402; Fridays 3-5pm or by appointment</center>
<center>'''Office:''' B402 Belfer Research Building, 413 East 69th Street, New York, NY 10021, USA</center>
<center>'''Course Website:''' http://diverge.hunter.cuny.edu/labwiki/Biol375_2019</center>
<center>'''Email:''' weigang@genectr.hunter.cuny.edu</center>
<center>christopher.panlasigui47@myhunter.cuny.edu</center>
<center>'''Lab Website:''' http://diverge.hunter.cuny.edu/labwiki/</center>
----
<center>
[[File:Borreliabase-screenshot-1.png|350px|thumbnail]]
{| class="wikitable"
==Course Description==
Molecular evolution is the study of the change of DNA and protein sequences through time. Theories and techniques of molecular evolution are widely used in species classification, biodiversity, comparative genomics, and molecular epidemiology. Contents of the course include:
* Population genetics, which is a theoretical framework for understanding mechanisms of sequence evolution through mutation, recombination, gene duplication, genetic drift, and natural selection.
* Molecular systematics, which introduces statistical models of sequence evolution and methods for reconstructing species phylogeny.
* Bioinformatics, which  provides hands-on training on data acquisition and the use of software tools for phylogenetic analyses.
 
This 3-credit course is designed for upper-level biology-major undergraduates.  Hunter pre-requisites are BIOL203, and MATH150 or STAT113.
 
==Textbooks==
* ('''Required''') Graur, 2016, Molecular and Genome Evolution, First Edition, Sinauer Associates, Inc. ISBN: 978-1-60535-469-9. [http://www.sinauer.com/molecular-and-genome-evolution.html Publisher's Website] (Student discount: a 15% discount and receive free UPS standard shipping)
http://www.sinauer.com/molecular-and-genome-evolution.html)
* (''Recommended'') Baum & Smith, 2013. Tree Thinking: an Introduction to Phylogenetic Biology, Roberts & Company Publishers, Inc.
 
==Learning Goals==
* Be able to describe evolutionary relationships using phylogenetic trees
* Be able to use web-based as well as stand-alone software to infer phylogenetic trees
* Understand mechanisms of DNA sequence evolution
* Understand algorithms for building phylogenetic trees
 
==Links for phylogenetic tools==
* [http://www.ncbi.nlm.nih.gov/ NCBI sequence databases]
* R Tools
** R source: download & install from [https://mirrors.nics.utk.edu/cran/ a mirror site]
** R Studio: [https://www.rstudio.com/ download & install]
** APE package
** phangorn package
* [http://phylogeny.fr/ A Molecular Phylogeny Web Server]
* [http://www.evolgenius.info/evolview/ EvolView: an online tree viewer]
 
==Exams & Grading==
* Bonus for full attendance & active participation in classroom discussions.
* Assignments.  All assignments should be handed in as hard copies only. Email submission will not be accepted. Late submissions will receive 10% deduction (of the total grade) per day.
* Three Mid-term Exams (30 pts each)
* Comprehensive Final Exam (50 pts)
 
==Academic Honesty==
While students may work in groups and help each other for assignments, duplicated answers in assignments will be flagged and investigated as possible acts of academic dishonesty. To avoid being investigated as such, <font color="red">do NOT copy anyone else's work, or let others copy your work</font>. At the least, rephrase using your own words. Note that the same rule applies regarding the use of textbook and online resources: copied sentences are not acceptable and will be considered plagiarism.
 
Hunter College regards acts of academic dishonesty (e.g., plagiarism, cheating on examinations, obtaining unfair advantage, and falsification of records and official documents) as serious offenses against the values of intellectual honesty. The College is committed to enforcing the CUNY Policy on Academic Integrity and will pursue cases of academic dishonesty according to the Hunter College Academic Integrity Procedures.
 
==Course Schedule==
===Part 1. Tree Thinking===
* 8/29 (TH). Overview & Introduction. Textbook Chapter: "Introduction" (pages 1-3)
{| class="wikitable sortable mw-collapsible"
! Assignment 1 (10 pts; Due next class 9/5)
|-
|-
|
! MA plot !! Volcano plot !! Heat map
* (10 pts) Pre-test: Full credits will be given as long as each question is answered with some reasoning. In other words, it will NOT be graded on being right or wrong. It's an assessment tool, to be compared with later test outcomes to show teaching/learning results. [[File:Pretest.pdf|thumbnail]]
|}
* 9/5 (TH). Introduction (Continued)
** R terminologies
*** Object: variable that contains data (e.g., "iris")
*** Object class: type of data (e.g., "data.frame", which is a table)
*** Function: e.g., data(iris), which loads the data set called "iris"
*** Function arguments: input and options (e.g., "iris" above)
** Tutorial: R & R-Studio <font color="red">(Bring your own computer)</font>
** Lecture slides: [[File:Intro-2019.pdf|thumbnail]]
{| class="wikitable sortable mw-collapsible"
! Assignment 2 (5 pts; Due: next session)
|-
|-
| R exercises
| [[File:GeneExp1.jpeg|300px|thumbnail| fold change (y-axis) vs. total expression levels (x-axis)]] ||
# Install R & R-studio (see "Links for phylogenetic tools" above)
[[File:GeneExp2.jpeg|300px|thumbnail| p-value (y-axis) vs. fold change (x-axis)]]
# Open R-studio and install the "ape" package using the "Packages"->"Install" menu, located within the lower right window
||
# Type in the console window (lower left) the following commands (one at a time, wait for the prompt ">" to appear before proceed to the next command; quit & restart R-studio if stuck):
[[File:GeneExp3.jpeg|300px|thumbnail| genes significantly down or up-regulated (at p<1e-4)]]
## library(ape)
## tr <- read.tree(text = "(monkey:0.09672,((tarsier:0.18996,lemur:0.14790)0.999:0.09005,(macaque:0.18524,(gibbon:0.10388,(orang-utan:0.09481,(human:0.03391,(gorilla:0.06135,chimpanzee:0.05141):0.01580)0.316:0.05381)1.000:0.03019)0.978:0.05616)0.997:0.05042)0.965:0.09672);")
## plot(tr)
# Export the tree graph using the "Export"->"Save as PDF" or "Save as Image" menu in the lower right window
# Exit R studio by typing the command "q()" and type "y" to answer the question for saving the R session
# Copy & paste the tree image into your document to be handed in
|}
|}
* 9/9 (M). Intro to trees
</center>
** Go over pre-test questions
==Course Overview==
** In-class exercise 1 (5 pts)
Welcome to Introductory BioMedical Genomics, a seminar course for advanced undergraduates and graduate students. A genome is the total genetic content of an organism. Driven by breakthroughs such as the decoding of the first human genome and rapid DNA and RNA-sequencing  technologies, biomedical sciences are undergoing a rapid & irreversible transformation into a highly data-intensive field, that requires familiarity  with concepts in both biology, computational, and data sciences. 
** Introduction to tree
* 9/12 (TH). Intro to trees (continued)
** In-class exercise 2. (5 pts)
** Textbook Chapter 5: "Molecular Phylogenetics" (pages 170-175; 201-202)
* 9/16 (M). Species Tree & Lineage Sorting.
** Textbook Chapter 5: "Molecular Phylogenetics" (pages 177-180).
* 9/19 (TH). Consensus Tree & Review.
** Chapter 5. pages 199-200 (Figure 5.31)
** In-class exercise 3. (5 pts, due next session)
** Lecture Slides: [[File:Part-1-tree-thinking-2019.pdf|thumbnail]]
* 9/23 (M). 4:10 - 5:10pm '''Midterm Exam I''' <font color="red">Bring pencils, erasers, and a calculator</font>


===Part 2. Analysis of Trait Evolution===
Genome information is revolutionizing virtually all aspects of life sciences including basic research, medicine, and agriculture. Meanwhile, use of genomic data requires life scientists to be familiar with concepts and skills in biology, computer science, as well as data analysis.
* 9/26 (TH).  Traits & trait matrix
** Textbook Chapter 5, pages 180-183
** R demo I (by Chris)
<syntaxhighlight lang='bash'>
# iris dataset exercise
# load libraries
library(tidyverse)
library(datasets)
data('iris')


# summary of data
This workshop is designed to introduce computational analysis of genomic data through hands-on computational exercises, using published studies.
summary(iris)
glimpse(iris)
iris %>% glimpse()


# previewing data
The pre-requisites of the course are college-level courses in molecular biology, cell biology, and genetics. Introductory courses in computer programming and statistics are preferred but not strictly required.
head(iris)


# subsetting data
==Learning goals==
slice(iris, 1:3)
By the end of this course successful students will be able to:  
iris %>% slice(1:3)
* Describe next-generation sequencing  (NGS) technologies & contrast it with traditional Sanger sequencing
* Explain applications of NGS technology including pathogen genomics, cancer genomics, human genomic variation, transcriptomics, meta-genomics, epi-genomics, and microbiome.
* Visualize and explore genomics data using RStudio
* Replicate key results using a raw data set produced by a primary research paper


# grouping and subsetting data
==Web Links==
iris %>%
* Install R base: https://cloud.r-project.org
  group_by(Species) %>%
* Install R Studio (Desktop version): http://www.rstudio.com/download
  slice(1:3)
* Textbook: [http://r4all.org/#about Introduction to R for Biologists]
* Download: [http://www.r4all.org/books/datasets R datasets]
* A reference book: [https://r4ds.had.co.nz/ R for Data Science (Wickharm & Grolemund)]


iris %>%
==Quizzes and Exams==
  group_by(Species) %>%
Student performance will be evaluated by attendance, three (4) quizzes and a final report:
  summarise(average = mean(Sepal.Length))
* Attendance & In-class participation: 50 pts
* Assignments: 5 x 10 = 50 pts
* Quizzes: 2 x 25 pts = 50 pts
* Mid-term: 50 pts
* Final presentation & report: 100 pts
Total: 300 pts


# filtering data
==Tips for Success==
filter(iris, Species == 'versicolor')
To maximize the your experience we strongly recommend the following strategies:
iris %>%
* Follow the directions for efficiently, finding high-impact papers, reading science research papers and preparing presentations.
  filter(Species == 'versicolor')
* Read the papers, watch required videos and do the exercises regularly, long before you attend class.
* Attend all classes, as required.  Late arrival results in loss of points.
* Keep up with online exercises. Don’t wait until the due date to start tasks.
* Take notes or annotate slides while attending the lectures.
* Listen actively and participate in class and in online discussions. 
* Review and summarize material within 24 hrs after class.
* Observe the deadlines for submitting your work. Late submissions incur penalties.
* Put away cell phones, do not TM, email or play computer games in class.


iris %>%
==Hunter/CUNY Policies==
  filter(Sepal.Length >= 7)
* Policy on Academic Integrity
Hunter College regards acts of academic dishonesty (e.g., plagiarism, cheating on homework, online exercises or examinations, obtaining unfair advantage, and falsification of records and official documents) as serious offenses against the values of intellectual honesty.  The College is committed to enforcing the CUNY Policy on Academic Integrity, and we will pursue cases of academic dishonesty according to the Hunter College Academic Integrity Procedures.  Students will be asked to read this statement before exams.


# OR operation
* ADA Policy
iris %>%
In compliance with the American Disability Act of 1990 (ADA) and with Section 504 of the Rehabilitation Act of 1973, Hunter College is committed to ensuring educational parity and accommodations for all students with documented disabilities and/or medical conditions. It is recommended that all students with documented disabilities (Emotional, Medical, Physical, and/or Learning) consult the Office of AccessABILITY, located in Room E1214B, to secure necessary academic accommodations. For further information and assistance, please call: (212) 772- 4857 or (212) 650-3230.
  filter(Sepal.Length < 5 | Sepal.Length > 7)


# check distribution using histogram
* Syllabus Policy
ggplot(iris, aes(x = Sepal.Length)) +
Except for changes that substantially affect implementation of the evaluation (grading) statement, this syllabus is a guide for the course and is subject to change with advance notice, announced in class or posted on Blackboard.
  geom_histogram()


# distribution by Species
==Course Schedule==
ggplot(iris, aes(x = Sepal.Length, fill = Species)) +
===Jan 26, 2019===
  geom_histogram(alpha = 0.5)
* Introduction
* Biblio Searching (NGS) & Group Assignments.
* R Tutorial: Chapter 1. user interface
* Assignment 1: Prepare Group Presentation; Upload by next Friday at 10pm


# distribution by Species using facetwrap
===Feb 2, 2019===
ggplot(iris, aes(x = Sepal.Length, color = Species)) +
* Group Presentations (20 min/20 slides) on Next-Generation Sequencing Technologies
  geom_histogram() + facet_wrap(~Species)
** Group 1 (Simran, Tiffany, Barbara). NGS vs Sanger
** Group 2 (Kevin, Roland, Hamza): PacBio & Margos et al
** Group 3 (Jerry, Kenneth, Muhammad): Illumina & Di et al
* R Tutorial, Chapter 2. How to get data into R
* Assignment 2: Using ggplot2 to visualize the "height" data set
** boxplot: height by gender
** density plot: density by gender
** histogram by gender
** Upload by next Friday at 10om


# boxplot
===Feb 9, 2019===
ggplot(iris, aes(y = Sepal.Length, x = Species)) +
* Group presentations
  geom_boxplot()
** Group 4 (Jamie, Tony, Pavan: IonTorrent & NanoPore
** Group 5 (Bing & Nevila) NGS applications in Lyme genomics
* R Tutorial. Chapter 3. Data manipulation wiht dplyr
===Feb 16, 2019===
* (Self study; No live class)
* Group assignments for research/application papers
** Bac.genome: Tony & Jamie
** Cancer.genome: Pavan & Kenetch
** Chip-Seq: Simran & Jerry
** Human genome: Roland & Barbara
** Microbiome: Muhammad & Tiffany
** RNA-Seq: Bing & Chris
** Signle.Cell transcriptome: Kevin & Hamza
** Each student should search the PubMed and identify one primary research paper for a 5-slide presentation next week
* R tutorial: Chapter 4. Data visualization with ggplot2
* Assignment #3: all R scripts and 5 plots in Chapter 4
===Feb 23, 2019===
(Belfer Building, Room BB401, East Conference Room)
* 5-slide presentation on selected paper, including Objective/Goal, Material & Methods, Main results (you want to replicate), & Available data sets and scripts
* Review for mid-term
===March 2, 2019===
* Mid-term exam, including
** NGS terms & vocabulary
** Advantages of NGS over traditional Sanger sequencing
** R practicum: read data table into R; make tall tables; manipulation of data frame with dplyr; basic plots with ggplot2
** Revise/Refine your last presentation by including the following parts:
*** Title slide: Paper citation, web links, group member names, date, version
*** Background/Objectives: 1 slide
*** Experimental samples & methods: 1-2 slides (including NGS tech used)
*** Analytical methods: software, main statistical methods (e.g., type of graphs, tests, and p-value interpretation)
*** Results: 1-2 graphs
*** Conclusion: 1 slide
*** Supplemental Material: 1 data set you will be re-analyzing
*** Analytical plan: type of graphs & type of statistical tests


# boxplot with points
===March 9, 2019===
ggplot(iris, aes(y = Sepal.Length, x = Species)) +
* R tutorial: Section 5.2. Contingency analysis
  geom_boxplot() +
* Group presentations (Data set identified)
  geom_jitter(size = 2, width = 0.1, alpha = 0.5, color = 'blue')
===March 16, 2019===
* R tutorial: Section 5.3. t-test
* Group presentations (Data visualization)
===March 23, 2019===
* (Self study; No live class)
* Abstract (200 words; individualized; due 3/30)
* Review contingency test & two-sample t-test
* Generate preliminary graphs


# scatterplot
===March 30, 2019===  
ggplot(iris, aes(y = Sepal.Length, x = Petal.Length, color = Species)) + geom_point()
* 20 pts Quiz on contingency test & two-sample t-test
</syntaxhighlight>
* Group presentations (Show preliminary graphs)
{| class="wikitable sortable mw-collapsible"
* Material & Methods (due 4/6)
|- style="background-color:lightsteelblue;"
 
! Assignment #3 (5 pts; Due next session)
===April 6, 2019===  
|- style="background-color:white;"
* 20 pts Quiz
|Watch [http://media.hhmi.org/biointeractive/films/OriginSpecies-Lizards.html Origin of Species: Lizards in an Evolutionary Tree]. Provide short answer (1-3 sentences) to each of the following three questions.
* R tutorial: Section 5.4. Regression analysis
# What are the two hypotheses explaining the origin of different ecomorphs of lizards on Caribbean Islands?
* Results (due 4/13)
# What is the expected phylogeny under each hypothesis?
** Tables to show the dataset you work on (not all, but a sample)
# Which hypothesis is supported by the phylogeny of actual DNA sequences?
** Figures with legend (R methods, x & y-axis, conclusion)
|}
** 1-paragraph summary of your results
* 10/3 (TH). Homoplasy & consistency
 
** Character & Character states
===April 13, 2019===
** R Demo (part 2) (Crhis)
* 20 pts Quiz. Regression analysis
{| class="wikitable sortable mw-collapsible"
* Background & Introduction (due 5/4)
|- style="background-color:lightsteelblue;"
! Bonus R Exercise (10 pts; Due 10/10, Thursday)
|- style="background-color:white;"
|
# In R studio, load the tidyverse library and read the human gene data table with <code>hg <- read_tsv(file = "http://diverge.hunter.cuny.edu/~weigang/data-sets-for-biostat/hg.tsv2", col_name = T)</code>
# Show commands and outputs for the following operations:
## Show first three genes for each chromosome
## Count the number of genes on each chromosome
## Add a column called "Gene.Length"
## Calculate the mean, max, and min gene length on each chromosome
## Show distribution of gene length by a histogram (with binwidth=1e4)
## Show above with log10 transformation
## Show distribution of gene length on each chromosome (with facet_wrap)
## Show distribution of gene length on each chromosome with a boxplot
|}
* 10/7 (M). Parsimony reconstruction (Chapter 5).
** Textbook Chapter 5, pages 188-191
{| class="wikitable sortable mw-collapsible"
|- style="background-color:lightsteelblue;"
! Assignment #4 (5 pts; Due next session)
|- style="background-color:white;"
|
# Download or Copy/Paste [http://media.hhmi.org/biointeractive/activities/lizard/Anolis-DNA-sequences.txt the lizard DNA sequences] to your own computer and save the file as "lizard.txt"
# Align the DNA sequences [http://www.phylogeny.fr/one_task.cgi?task_type=muscle using this website] and save the aligned DNA file ("Output->Alignment in Fasta format") as "lizard-aligned.txt". Use "one-click" option in the Phylogeny Analysis tab to make a tree.
# Based on [http://media.hhmi.org/biointeractive/activities/lizard/Lizard-Cards-Color.pdf the lizard card], construct a character-state matrix for all lizard species. For each species, list its character state for each of the following two characters (as columns): (1) Geographic origin, and (2) Habitat.
# Construct a diagram by combining the tree and the character-state matrix, showing character states for each species on each row.
# Determine which hypothesis ("Multiple origin" or "Single origin" of ecomorphs) is more supported by the mtDNA tree. Explain.
|}


* 10/10 (TH). Parsimony reconstruction (Continued)
===May 4, 2019===
** In-Class Exercise 4 [[File:In-class-4.pdf|thumbnail]]
* Final presentation I. Graded on:
** Lecture slides: [[File:Part-2-trait-evolution-2019-small.pdf|thumbnail]]
** Objective (original & your own)
* 10/16 (Wed. Monday Schedule). Genome & gene structure (Chapter 3)  
** Material & methods (original & your own)
** Calculate consistency indices for lizard ecomorphs & geographic orgins
** Results (your own)
** [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3622293/ | Graur et al (2013). "On the immotality of television sets"]
** Conclusion (your own)
* 10/17 (TH). Review & Practices.
** Conclusion (due 5/11)
** In-class exercise: hemoglobin gene structure  [[File:In-class-5.pdf|thumbnail]]
** In-Class Exercise: Pretest Part 2, [[File:Pretest-2.pdf|thumbnail]]
* 10/21 (M). '''Midterm Exam 2'''


===Part 3. Tree Algorithms===
===May 11, 2019===
* 10/24 (TH). (No Class)
* Self study: Prepare your 10-slide presentation
* 10/28 (M).
* No class (instructor travels)
** BLAST & Alignments (Chapter 3. pages 93-100).In-class exercise: Run BLAST; show alignment & explain E-value
** Genetic distances
* 10/31 (TH).
** Sequence-evolutionary models (Chapter 3, pages 79-88). In-class exercise: Poisson simulation & explain
** Lecture slides: [[File:Part-3-tree-construction-2019.pdf|thumbnail]]
* 11/4 (M).
** Distance methods (Chapter 5, pages 184-187). In class exercise: use APE package to calculate genetic distances
** In class exercise: calculate Jukes-Cantor distance of [http://slideplayer.com/slide/8016962/25/images/8/Example+of+DNA+sequence+alignment.jpg this DNA sequence alignment]. Note: Ignore gapped positions.
* 11/7 (TH).
** Maximum parsimony (Chapter 5, pages 191-194). In-class exercise: parsimony scores
** Likelihood & Bayesian methods; 
** Bonus assignment II (5 pts, Due 11/18, Monday):
{| class="wikitable"
|-
|
* The two graphs show the log likelihoods (i.e., goodness of fit, or Prob(Data|Model)) of four nucleotide-substitution models for describing patterns of Human/Chimp DNA sequence divergence
* Reproduce (with proper axis labels and custom size and shape for the points) one of the graphs using R/ggplot2. Read the data set using <code>lk <- read_csv("http://diverge.hunter.cuny.edu/~weigang/lk.csv")</code>
* Explain why HKY is the best model for the data
| [[File:Lk-plot-label.png|thumbnail|Hint: use geom_label()]] || [[File:Lk-plot-color.png|thumbnail|Hint: use geom_point()]]
|}
   
   
* 11/11 (M). 
===May 18, 2019, 9-1pm===
** Tree Testing (Chapter 5, pages 194-198).
* Final presentation
* 11/14 (TH).
* May 22, 2018 (Wed, 5pm) Final Report Due (ihard copy; n my office or in mailbox)
** Review exercises (Chapter 5, pages 207-209) .
* 11/18 (M).  '''3rd Mid-term exam'''
 
===Part 4. Mechanisms of molecular evolution===
* 11/21 (TH).
** Mechanism of molecular evolution: Overview (pages 35-38) & Rates of nucleotide substitutions (pages 111-125).
* 11/25 (M). In-class computer exercise:
** Ka/Ks test of natural selection (pg 116-124). In-class exercise
{| class="wikitable sortable mw-collapsible"
|- style="background-color:lightsteelblue;"
! Final project (20 pts). Due: 12/9, Monday)
|- style="background-color:white;"
|
# Calculate genetic distances
## Download or Copy/Paste [http://media.hhmi.org/biointeractive/activities/lizard/Anolis-DNA-sequences.txt the lizard DNA sequences] to your own computer and save the file as "anoles.txt" in a directory (e.g., "Document")
## Align the DNA sequences [http://www.phylogeny.fr/one_task.cgi?task_type=muscle using this website] and save the aligned DNA file ("Output->Alignment in Fasta format") as "anoles-aligned.txt" (No need to print or submit the above two DNA sequence files; save them in a folder, e.g., "Document")
## Download & load library: library(ape)
## In RStudio, set working directory to the same one containing alignemnt ("Session" -> "Set Working Directory" -> "Choose Directory")
## Read alignment: mt <- read.FASTA("anoles-aligned.txt")
## Calculate raw distance: mt.raw <- dist.dna(mt, model = "raw")
## Apply Juke-Cantor (one-parameter model) correction: mt.jc <- dist.dna(mt, model = "JC")
## Apply Kimura(two-parameter model, for Ts and Tv) correction: mt.k80 <- dist.dna(mt, model = "K80") to
## Plot JC distance vs the raw distance: plot(mt.raw, mt.jc, xlab = "uncorrected distance (diff/site)", ylab = "corrected distance (sub/site)", xlim = c(0,0.4), ylim = c(0,0.5), las =1)
## Add a 1:1 line: abline(0,1, col = "red")
## Add K80 distances: points(mt.raw, mt.k80, pch = 3, col = "blue")
## Add a legend: legend(0.05, 0.45, legend = c("JC (1-parameter)", "K80 (2-parameter)"), pch = c(1,3), col = c("black","blue"), bty = "n")
## Export an PDF and print a copy
## Use the graph to explain
### (1) Why it is necessary to correct for raw distances when comparing sequences from distantly related species;  
### (2) What is the key difference between the K80 and JC models
# Comparison of distance and parsimony trees (review previous assignments for detailed R-Studio instructions)
## In R studio, install & load the "ape" and "phangorn" libraries
### Obtain a neighbor-joining tree using K80 model: tree.nj <- NJ(mt.k80)
### Plot a midpoint rooted tree: plot(midpoint(tree.nj))
### Add a scale bar: add.scale.bar()
### Print tree and answer this question: what does the distance represent? What is the unit?
## Obtain a maximum parsimony tree
### Convert object to a different class: aln.phy <- as.phyDat(mt)
### Search maximum parsimony tree.mp <- optim.parsimony(tree.nj, aln.phy)
### Get tree distance: tree.mp <- acctran(tree.mp, aln.phy)
### Plot tree: plot(midpoint(tree.mp))
### Add a scale bar: add.scale.bar()
### Print tree and answer the question: what does the distance represent? What is the unit?
## Compare the two trees and explain the differences in these two methods: Which one uses full sequence information and why?
# Bootstrap analysis
## aln.fas <- read.dna("anoles-aligned.txt", format ="fasta")
## Create a function for re-rooted distance tree: tree.fun <- function(x) root(nj(dist.dna(x)), outgroup = c("Leiocephalus_barahonensis"), resolve.root = T)
## Calculate a tree: tr <- tree.fun(aln.fas)
## Perform bootstrap for 100 pseudo-replicates:  boot.trees <- boot.phylo(tr, aln.fas, tree.fun, B=100, rooted =T)
## Plot tree: plot(tr, no.margin = T)
## Add bootstrap values as node labels: nodelabels(boot.trees, bg= "white")
## Explain (1) Does bootstrap test for tree precision or tree accuracy? (2) What does a bootstrap value of 80% mean?
|}
* 12/2 (M). SNP statistics & gene frequency analysis: In-class exercises.
* 12/5 (TH) Genetic Drift (pages 47-49). Lecture slides: [[File:Part-4-evol-mechanism-2018.pdf|thumbnail]]
* 12/9 (M). (Last Lecture) Review & Course evaluations. Final review slides: [[File:Final-review-2018.pdf|thumbnail]]
** '''Submit your Teacher's Evaluation''', using either:
** Personal computer at [http://www.hunter.cuny.edu/te www.hunter.cuny.edu/te]; or,
** Smartphone at [http://www.hunter.cuny.edu/mobilete www.hunter.cuny.edu/mobilete]
* Dec 16 (Monday) 4-6pm: '''Comprehensive Final  Exam'''

Latest revision as of 21:13, 31 October 2019

BIOL47120 Biomedical Genomics II
Spring 2019
Instructor: Weigang Qiu, Ph.D.
Professor, Department of Biological Sciences, City University of New York, Hunter College & Graduate Center
Adjunct Faculty, Department of Physiology and Biophysics, Institute for Computational Biomedicine, Weil Cornell Medical College
Office: B402 Belfer Research Building, 413 East 69th Street, New York, NY 10021, USA
Email: weigang@genectr.hunter.cuny.edu
Lab Website: http://diverge.hunter.cuny.edu/labwiki/
MA plot Volcano plot Heat map
fold change (y-axis) vs. total expression levels (x-axis)
p-value (y-axis) vs. fold change (x-axis)
genes significantly down or up-regulated (at p<1e-4)

Course Overview

Welcome to Introductory BioMedical Genomics, a seminar course for advanced undergraduates and graduate students. A genome is the total genetic content of an organism. Driven by breakthroughs such as the decoding of the first human genome and rapid DNA and RNA-sequencing technologies, biomedical sciences are undergoing a rapid & irreversible transformation into a highly data-intensive field, that requires familiarity with concepts in both biology, computational, and data sciences.

Genome information is revolutionizing virtually all aspects of life sciences including basic research, medicine, and agriculture. Meanwhile, use of genomic data requires life scientists to be familiar with concepts and skills in biology, computer science, as well as data analysis.

This workshop is designed to introduce computational analysis of genomic data through hands-on computational exercises, using published studies.

The pre-requisites of the course are college-level courses in molecular biology, cell biology, and genetics. Introductory courses in computer programming and statistics are preferred but not strictly required.

Learning goals

By the end of this course successful students will be able to:

  • Describe next-generation sequencing (NGS) technologies & contrast it with traditional Sanger sequencing
  • Explain applications of NGS technology including pathogen genomics, cancer genomics, human genomic variation, transcriptomics, meta-genomics, epi-genomics, and microbiome.
  • Visualize and explore genomics data using RStudio
  • Replicate key results using a raw data set produced by a primary research paper

Web Links

Quizzes and Exams

Student performance will be evaluated by attendance, three (4) quizzes and a final report:

  • Attendance & In-class participation: 50 pts
  • Assignments: 5 x 10 = 50 pts
  • Quizzes: 2 x 25 pts = 50 pts
  • Mid-term: 50 pts
  • Final presentation & report: 100 pts

Total: 300 pts

Tips for Success

To maximize the your experience we strongly recommend the following strategies:

  • Follow the directions for efficiently, finding high-impact papers, reading science research papers and preparing presentations.
  • Read the papers, watch required videos and do the exercises regularly, long before you attend class.
  • Attend all classes, as required. Late arrival results in loss of points.
  • Keep up with online exercises. Don’t wait until the due date to start tasks.
  • Take notes or annotate slides while attending the lectures.
  • Listen actively and participate in class and in online discussions.
  • Review and summarize material within 24 hrs after class.
  • Observe the deadlines for submitting your work. Late submissions incur penalties.
  • Put away cell phones, do not TM, email or play computer games in class.

Hunter/CUNY Policies

  • Policy on Academic Integrity

Hunter College regards acts of academic dishonesty (e.g., plagiarism, cheating on homework, online exercises or examinations, obtaining unfair advantage, and falsification of records and official documents) as serious offenses against the values of intellectual honesty. The College is committed to enforcing the CUNY Policy on Academic Integrity, and we will pursue cases of academic dishonesty according to the Hunter College Academic Integrity Procedures. Students will be asked to read this statement before exams.

  • ADA Policy

In compliance with the American Disability Act of 1990 (ADA) and with Section 504 of the Rehabilitation Act of 1973, Hunter College is committed to ensuring educational parity and accommodations for all students with documented disabilities and/or medical conditions. It is recommended that all students with documented disabilities (Emotional, Medical, Physical, and/or Learning) consult the Office of AccessABILITY, located in Room E1214B, to secure necessary academic accommodations. For further information and assistance, please call: (212) 772- 4857 or (212) 650-3230.

  • Syllabus Policy

Except for changes that substantially affect implementation of the evaluation (grading) statement, this syllabus is a guide for the course and is subject to change with advance notice, announced in class or posted on Blackboard.

Course Schedule

Jan 26, 2019

  • Introduction
  • Biblio Searching (NGS) & Group Assignments.
  • R Tutorial: Chapter 1. user interface
  • Assignment 1: Prepare Group Presentation; Upload by next Friday at 10pm

Feb 2, 2019

  • Group Presentations (20 min/20 slides) on Next-Generation Sequencing Technologies
    • Group 1 (Simran, Tiffany, Barbara). NGS vs Sanger
    • Group 2 (Kevin, Roland, Hamza): PacBio & Margos et al
    • Group 3 (Jerry, Kenneth, Muhammad): Illumina & Di et al
  • R Tutorial, Chapter 2. How to get data into R
  • Assignment 2: Using ggplot2 to visualize the "height" data set
    • boxplot: height by gender
    • density plot: density by gender
    • histogram by gender
    • Upload by next Friday at 10om

Feb 9, 2019

  • Group presentations
    • Group 4 (Jamie, Tony, Pavan: IonTorrent & NanoPore
    • Group 5 (Bing & Nevila) NGS applications in Lyme genomics
  • R Tutorial. Chapter 3. Data manipulation wiht dplyr

Feb 16, 2019

  • (Self study; No live class)
  • Group assignments for research/application papers
    • Bac.genome: Tony & Jamie
    • Cancer.genome: Pavan & Kenetch
    • Chip-Seq: Simran & Jerry
    • Human genome: Roland & Barbara
    • Microbiome: Muhammad & Tiffany
    • RNA-Seq: Bing & Chris
    • Signle.Cell transcriptome: Kevin & Hamza
    • Each student should search the PubMed and identify one primary research paper for a 5-slide presentation next week
  • R tutorial: Chapter 4. Data visualization with ggplot2
  • Assignment #3: all R scripts and 5 plots in Chapter 4

Feb 23, 2019

(Belfer Building, Room BB401, East Conference Room)

  • 5-slide presentation on selected paper, including Objective/Goal, Material & Methods, Main results (you want to replicate), & Available data sets and scripts
  • Review for mid-term

March 2, 2019

  • Mid-term exam, including
    • NGS terms & vocabulary
    • Advantages of NGS over traditional Sanger sequencing
    • R practicum: read data table into R; make tall tables; manipulation of data frame with dplyr; basic plots with ggplot2
    • Revise/Refine your last presentation by including the following parts:
      • Title slide: Paper citation, web links, group member names, date, version
      • Background/Objectives: 1 slide
      • Experimental samples & methods: 1-2 slides (including NGS tech used)
      • Analytical methods: software, main statistical methods (e.g., type of graphs, tests, and p-value interpretation)
      • Results: 1-2 graphs
      • Conclusion: 1 slide
      • Supplemental Material: 1 data set you will be re-analyzing
      • Analytical plan: type of graphs & type of statistical tests

March 9, 2019

  • R tutorial: Section 5.2. Contingency analysis
  • Group presentations (Data set identified)

March 16, 2019

  • R tutorial: Section 5.3. t-test
  • Group presentations (Data visualization)

March 23, 2019

  • (Self study; No live class)
  • Abstract (200 words; individualized; due 3/30)
  • Review contingency test & two-sample t-test
  • Generate preliminary graphs

March 30, 2019

  • 20 pts Quiz on contingency test & two-sample t-test
  • Group presentations (Show preliminary graphs)
  • Material & Methods (due 4/6)

April 6, 2019

  • 20 pts Quiz
  • R tutorial: Section 5.4. Regression analysis
  • Results (due 4/13)
    • Tables to show the dataset you work on (not all, but a sample)
    • Figures with legend (R methods, x & y-axis, conclusion)
    • 1-paragraph summary of your results

April 13, 2019

  • 20 pts Quiz. Regression analysis
  • Background & Introduction (due 5/4)

May 4, 2019

  • Final presentation I. Graded on:
    • Objective (original & your own)
    • Material & methods (original & your own)
    • Results (your own)
    • Conclusion (your own)
    • Conclusion (due 5/11)

May 11, 2019

  • Self study: Prepare your 10-slide presentation
  • No class (instructor travels)

May 18, 2019, 9-1pm

  • Final presentation
  • May 22, 2018 (Wed, 5pm) Final Report Due (ihard copy; n my office or in mailbox)