Southwest-University and Biol375 2019: Difference between pages

From QiuLab
(Difference between pages)
Jump to navigation Jump to search
imported>Weigang
 
imported>Lab
m (hw 4 q2 instruction)
 
Line 1: Line 1:
<center>'''Biomedical Genomics'''</center>
<center>'''Molecular Evolution''' (BIOL 375.00/790.64/793.03, Fall 2019)</center>
<center>July 8-19, 2019</center>
<center>'''Instructor:''' Dr Weigang Qiu, Professor, Department of Biological Sciences </center>
<center>'''Instructor:''' Weigang Qiu, Ph.D.<br>Professor, Department of Biological Sciences, City University of New York, Hunter College & Graduate Center<br>Adjunct Faculty, Department of Physiology and Biophysics,
<center>'''Room:''' 926 HN (Seminar Room, North Building)</center>
Institute for Computational Biomedicine, Weil Cornell Medical College</center>
<center>'''Hours:''' Mon. & Thur 4:10-5:25 pm</center>
<center>'''Office:''' B402 Belfer Research Building, 413 East 69th Street, New York, NY 10021, USA</center>
<center>'''Office Hours:''' Belfer Research Building ([https://www.google.com/maps/place/413+E+69th+St,+New+York,+NY+10021/@40.7655886,-73.9561743,17z/data=!3m1!4b1!4m2!3m1!1s0x89c258c3d235f76f:0x4f3d0d5d8a78fe6?hl=en Google Map]) BB-402; Fridays 3-5pm or by appointment</center>
<center>'''Email:''' weigang@genectr.hunter.cuny.edu</center>
<center>'''Course Website:''' http://diverge.hunter.cuny.edu/labwiki/Biol375_2019</center>
<center>'''Lab Website:''' http://diverge.hunter.cuny.edu/labwiki/</center>
<center>christopher.panlasigui47@myhunter.cuny.edu</center>
<br>
<center>'''Host''': Shunqin Zhu (祝顺琴), Ph.D.<br>Associate Professor, School of  Life Science, South West University</center>
----
----
[[File:Lp54-gain-loss.png|300px|thumbnail|Figure 1. Gains & losses of host-defense genes among Lyme pathogen genomes ([https://www.ncbi.nlm.nih.gov/pubmed/24704760 Qiu & Martin 2014])]]
[[File:Borreliabase-screenshot-1.png|350px|thumbnail]]
==Course Overview==
==Course Description==
Welcome to BioMedical Genomics, a computer workshop for advanced undergraduates and graduate students. A genome is the total genetic content of an organism. Driven by breakthroughs such as the decoding of the first human genome and next-generation DNA -sequencing technologies, biomedical sciences are undergoing a rapid and irreversible transformation into a highly data-intensive field.  
Molecular evolution is the study of the change of DNA and protein sequences through time. Theories and techniques of molecular evolution are widely used in species classification, biodiversity, comparative genomics, and molecular epidemiology. Contents of the course include:
* Population genetics, which is a theoretical framework for understanding mechanisms of sequence evolution through mutation, recombination, gene duplication, genetic drift, and natural selection.
* Molecular systematics, which introduces statistical models of sequence evolution and methods for reconstructing species phylogeny.
* Bioinformatics, which  provides hands-on training on data acquisition and the use of software tools for phylogenetic analyses.


Genome information is revolutionizing virtually all aspects of life sciences including basic research, medicine, and agriculture. Meanwhile, use of genomic data requires life scientists to be familiar with concepts and skills in biology, computer science, as well as data analysis.  
This 3-credit course is designed for upper-level biology-major undergraduates. Hunter pre-requisites are BIOL203, and MATH150 or STAT113.


This workshop is designed to introduce computational analysis of genomic data through hands-on computational exercises, using published studies.
==Textbooks==
* ('''Required''') Graur, 2016, Molecular and Genome Evolution, First Edition, Sinauer Associates, Inc. ISBN: 978-1-60535-469-9. [http://www.sinauer.com/molecular-and-genome-evolution.html Publisher's Website] (Student discount: a 15% discount and receive free UPS standard shipping)
http://www.sinauer.com/molecular-and-genome-evolution.html)
* (''Recommended'') Baum & Smith, 2013. Tree Thinking: an Introduction to Phylogenetic Biology, Roberts & Company Publishers, Inc.


The pre-requisites of the course are college-level courses in molecular biology, cell biology, and genetics. Introductory courses in computer programming and statistics are preferred but not strictly required.
==Learning Goals==
* Be able to describe evolutionary relationships using phylogenetic trees
* Be able to use web-based as well as stand-alone software to infer phylogenetic trees
* Understand mechanisms of DNA sequence evolution
* Understand algorithms for building phylogenetic trees


==Learning goals==
==Links for phylogenetic tools==
By the end of this course successful students will be able to:  
* [http://www.ncbi.nlm.nih.gov/ NCBI sequence databases]
* Describe next-generation sequencing  (NGS) technologies & contrast it with traditional Sanger sequencing
* R Tools
* Explain applications of NGS technology including pathogen genomics, cancer genomics, human genomic variation, transcriptomics, meta-genomics, epi-genomics, and microbiome.
** R source: download & install from [https://mirrors.nics.utk.edu/cran/ a mirror site]
* Visualize and explore genomics data using RStudio
** R Studio: [https://www.rstudio.com/ download & install]
* Replicate key results using a raw data set produced by a primary research paper
** APE package
* [http://phylogeny.fr/ A Molecular Phylogeny Web Server]
* [http://www.evolgenius.info/evolview/ EvolView: an online tree viewer]


==Web Links==
==Exams & Grading==
* Install R base: https://cloud.r-project.org
* Bonus for full attendance & active participation in classroom discussions.
* Install R Studio (Desktop version): http://www.rstudio.com/download
* Assignments.  All assignments should be handed in as hard copies only. Email submission will not be accepted. Late submissions will receive 10% deduction (of the total grade) per day.
* Download: [http://www.r4all.org/books/datasets R datasets]
* Three Mid-term Exams (30 pts each)
* A reference book: [https://r4ds.had.co.nz/ R for Data Science (Wickharm & Grolemund)]
* Comprehensive Final Exam (50 pts)


==Quizzes and Exams==
==Academic Honesty==
Student performance will be evaluated by attendance, three (4) quizzes and a final report:
While students may work in groups and help each other for assignments, duplicated answers in assignments will be flagged and investigated as possible acts of academic dishonesty. To avoid being investigated as such, <font color="red">do NOT copy anyone else's work, or let others copy your work</font>. At the least, rephrase using your own words. Note that the same rule applies regarding the use of textbook and online resources: copied sentences are not acceptable and will be considered plagiarism.
* Attendance: 50 pts
 
* Assignments: 5 x 10 = 50 pts
Hunter College regards acts of academic dishonesty (e.g., plagiarism, cheating on examinations, obtaining unfair advantage, and falsification of records and official documents) as serious offenses against the values of intellectual honesty. The College is committed to enforcing the CUNY Policy on Academic Integrity and will pursue cases of academic dishonesty according to the Hunter College Academic Integrity Procedures.
* Open-book Quizzes: 2 x 25 pts = 50 pts
* Take-home Mid-term: 50 pts
* Final presentation: 50 pts
Total: 250 pts


==Course Schedule==
==Course Schedule==
{| class="wikitable"
===Part 1. Tree Thinking===
|-
* 8/29 (TH). Overview & Introduction. Textbook Chapter: "Introduction" (pages 1-3)
! Date & Hour !! Tutorials !! Assignment !! Quiz & Exam
{| class="wikitable sortable mw-collapsible"
|-
! Assignment 1 (10 pts; Due next class 9/5)
| July 8 (Mon), 8:40-12:10 || Introduction; R Tutorial I; 
[[File:R-part-1-small.pdf|thumbnail|Lecture slides]]
||
Assignment #1 (create a WORD document including scripts & graphs (i.e., compile your work into a lab report, due tomorrow)
* Install R/R studio and the "tidyverse" package on your own computer
* Recreate Script 1 & Mini-Practical
* Show help page for function "seq"
* Download dataset
** Create a new folder (e.g., Desktop/rtutor)
** Create a sub-folder (e.g., Desktop/rtutor/data/)  
** Download from http://www.r4all.org/the-book/datasets
** Save to the sub-folder
** Unzip the file
  ||
|-
| July 9 (Tu), 8:40-12:10 || R Tutorials II && III,
[[File:R-part-2.pdf|thumbnail|Lecture slides]]
||
Assignment #2
* The following is a portion of the dataset of Mycobacterium growth (kindly shared by Aswad from Dr Xie's lab). It shows OD (optical density) values. Transform this table ("wide" format) into the "tall/tidy" format (use paper & pen, no need to use R studio or any computer program):
{| class="wikitable"
|-
|-
! Hour !! Control !! Gene !! Control.with.Arg  !! Gene.with.Arg
|  
|-
* (10 pts) Pre-test: Full credits will be given as long as each question is answered with some reasoning. In other words, it will NOT be graded on being right or wrong. It's an assessment tool, to be compared with later test outcomes to show teaching/learning results. [[File:Pretest.pdf|thumbnail]]
| 0 || 0.06 || 0.022 || 0.031 || 0.01
|-
| 4 || 0.087 || 0.102 || 0.082 || 0.081
|-
| 8 || 0.113 || 0.185 || 0.086 || 0.135
|}
|}
* In R studio, read the dataset from the file "FlowerColourVisits.csv" and save it into an object named as "flower"
* 9/5 (TH). Introduction (Continued)
** Show head, tail, dimension of the data frame "flower"
** R terminologies
** Show data summary with "summary" & "glimpse" commands. Which column is a categorical data type?
*** Object: variable that contains data (e.g., "iris")
** Select the column named "colour"
*** Object class: type of data (e.g., "data.frame", which is a table)
** Select rows from the 3rd to the 20th
*** Function: e.g., data(iris), which loads the data set called "iris"
** Select the 3rd, 10th, and 20th rows
*** Function arguments: input and options (e.g., "iris" above)
** Select only the rows that have the colour of "red" (hint use <code>colour=="red"</code>
** Tutorial: R & R-Studio <font color="red">(Bring your own computer)</font>
** Create a new column, named "logVisit", that is log(1+number.of.visit)
** Lecture slides: [[File:Intro-2019.pdf|thumbnail]]
** Sort the "flower" data by the column "number.of.visit"
** Perform the following data transformation using the chaining operator (i.e., "%>%"): Select rows from the 3rd to the 20th, then filter by colour of "red", and then show head
{| class="wikitable sortable mw-collapsible"
** Obtain the mean number of visit for each colour as a group (Hint: use "group_by" & "summarise")
! Assignment 2 (5 pts; Due: next session)
||
|-
|-
| July 10 (Wed), 8:40-12:10 || R Tutorial IV
| R exercises
[[File:R-part-3.pdf|thumbnail|Lecture slides]]
# Install R & R-studio (see "Links for phylogenetic tools" above)
||
# Open R-studio and install the "ape" package using the "Packages"->"Install" menu, located within the lower right window
Assignment #3
# Type in the console window (lower left) the following commands (one at a time, wait for the prompt ">" to appear before proceed to the next command; quit & restart R-studio if stuck):
{| class="wikitable"
## library(ape)
|-
## tr <- read.tree(text = "(monkey:0.09672,((tarsier:0.18996,lemur:0.14790)0.999:0.09005,(macaque:0.18524,(gibbon:0.10388,(orang-utan:0.09481,(human:0.03391,(gorilla:0.06135,chimpanzee:0.05141):0.01580)0.316:0.05381)1.000:0.03019)0.978:0.05616)0.997:0.05042)0.965:0.09672);")
! Task!! Graph
## plot(tr)
|-
# Export the tree graph using the "Export"->"Save as PDF" or "Save as Image" menu in the lower right window
| Use the "iris" dataset to reproduce the plot shown at right (Hint: load data with <code>data(iris)</code>) ||
# Exit R studio by typing the command "q()" and type "y" to answer the question for saving the R session
[[File:Iris-1.png|200px|thumbnail]]
# Copy & paste the tree image into your document to be handed in
|-
| Use the "flower" dataset (see Assignment #2 on how to load data) to reproduce the plot shown at right ||
[[File:Flower-1.png|200px|thumbnail]]
|}
|}
* 9/9 (M). Intro to trees
** Go over pre-test questions
** In-class exercise 1 (5 pts)
** Introduction to tree
* 9/12 (TH).  Intro to trees (continued)
** In-class exercise 2. (5 pts)
** Textbook Chapter 5: "Molecular Phylogenetics" (pages 170-175; 201-202)
* 9/16 (M). Species Tree & Lineage Sorting.
** Textbook Chapter 5: "Molecular Phylogenetics" (pages 177-180).
* 9/19 (TH). Consensus Tree & Review.
** Chapter 5. pages 199-200 (Figure 5.31)
** In-class exercise 3. (5 pts, due next session)
** Lecture Slides: [[File:Part-1-tree-thinking-2019.pdf|thumbnail]]
* 9/23 (M). 4:10 - 5:10pm '''Midterm Exam I''' <font color="red">Bring pencils, erasers, and a calculator</font>
===Part 2. Analysis of Trait Evolution===
* 9/26 (TH).  Traits & trait matrix
** Textbook Chapter 5, pages 180-183
** R demo I (by Chris)
<syntaxhighlight lang='bash'>
# iris dataset exercise
# load libraries
library(tidyverse)
library(datasets)
data('iris')
# summary of data
summary(iris)
glimpse(iris)
iris %>% glimpse()
# previewing data
head(iris)
# subsetting data
slice(iris, 1:3)
iris %>% slice(1:3)
# grouping and subsetting data
iris %>%
  group_by(Species) %>%
  slice(1:3)
iris %>%
  group_by(Species) %>%
  summarise(average = mean(Sepal.Length))
# filtering data
filter(iris, Species == 'versicolor')
iris %>%
  filter(Species == 'versicolor')
iris %>%
  filter(Sepal.Length >= 7)
# OR operation
iris %>%
  filter(Sepal.Length < 5 | Sepal.Length > 7)
# check distribution using histogram
ggplot(iris, aes(x = Sepal.Length)) +
  geom_histogram()
# distribution by Species
ggplot(iris, aes(x = Sepal.Length, fill = Species)) +
  geom_histogram(alpha = 0.5)
# distribution by Species using facetwrap
ggplot(iris, aes(x = Sepal.Length, color = Species)) +
  geom_histogram() + facet_wrap(~Species)


|| Quiz I
# boxplot
|-
ggplot(iris, aes(y = Sepal.Length, x = Species)) +
| July 11 (Thur), 8:40-12:10 || Intro to NGS; R Tutorial V ||
  geom_boxplot()
 
# boxplot with points
ggplot(iris, aes(y = Sepal.Length, x = Species)) +
  geom_boxplot() +
  geom_jitter(size = 2, width = 0.1, alpha = 0.5, color = 'blue')


||
# scatterplot
Take-home mid-term (50 pts)
ggplot(iris, aes(y = Sepal.Length, x = Petal.Length, color = Species)) + geom_point()
* List pros & cons of Sanger vs NGS
</syntaxhighlight>
* Compare accuracy, read length, and error rate between Illumina and PacBio
{| class="wikitable sortable mw-collapsible"
* Describe sequence information captured with each of the following file formats: FASTA, FASTQ, SAM, VCF
|- style="background-color:lightsteelblue;"
* Run t-test & regression analysis
! Assignment #3 (5 pts; Due next session)
|-
|- style="background-color:white;"
| July 12 - 14 (Fri, Sat & Sun) || (Weekend break; No class) || ||
|Watch [http://media.hhmi.org/biointeractive/films/OriginSpecies-Lizards.html Origin of Species: Lizards in an Evolutionary Tree]. Provide short answer (1-3 sentences) to each of the following three questions.
|-
# What are the two hypotheses explaining the origin of different ecomorphs of lizards on Caribbean Islands?
| July 15 (Mon), 8:00-12:10 || Case Study 1. Fish microbiome || Assignment #4
# What is the expected phylogeny under each hypothesis?
||  
# Which hypothesis is supported by the phylogeny of actual DNA sequences?
|-
|}
| July 16 (Tu), 8:00-12:10 || Case Study 2. Transcriptome || Assignment #5
* 10/3 (TH). Homoplasy & consistency
||  
** R Demo (part 2) (Crhis)
|-
{| class="wikitable sortable mw-collapsible"
| July 17 (Wed), 8:00-12:10 || Case Study 3. Lyme Disease  || || Quiz II
|- style="background-color:lightsteelblue;"
|-
! Assignment #4 (5 pts; Due next session)
| July 18 (Thur), 8:00-12:10|| || || Presentations
|- style="background-color:white;"
|  
# Download or Copy/Paste [http://media.hhmi.org/biointeractive/activities/lizard/Anolis-DNA-sequences.txt the lizard DNA sequences] to your own computer and save the file as "lizard.txt"
# Align the DNA sequences [http://www.phylogeny.fr/one_task.cgi?task_type=muscle using this website] and save the aligned DNA file ("Output->Alignment in Fasta format") as "lizard-aligned.txt". Use "one-click" option in the Phylogeny Analysis tab to make a tree.
# Based on [http://media.hhmi.org/biointeractive/activities/lizard/Lizard-Cards-Color.pdf the lizard card], construct a character-state matrix for all lizard species. For each species, list its character state for each of the following two characters (as columns): (1) Geographic origin, and (2) Habitat.
# Determine which hypothesis ("Multiple origin" or "Single origin" of ecomorphs) is more supported by the mtDNA tree. Explain.
|}
|}
* 10/7 (M). Parsimony reconstruction (Chapter 5).
** Textbook Chapter 5, pages 188-191
* 10/10 (TH). Parsimony reconstruction (Continued)
** In-Class Exercise 4 (Due next session)
* 10/16 (Wed. Monday Schedule). Genome & gene structure (Chapter 3)
** [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3622293/ | Graur et al (2013). "On the immotality of television sets"]
* 10/17 (TH). Review & Practices. Lecture Slides
** In-Class Exercise 5. Pretest Part 2 (molecular phylogenetics in forensics)
* 10/21 (M). '''Midterm Exam 2'''


==Papers & Datasets==
===Part 3. Tree Algorithms===
{| class="wikitable sortable"
* 10/24 (TH). BLAST & Alignments (Chapter 3. pages 93-100). In-class exercise: Run BLAST; show alignment & explain E-value
|-
* 10/28 (M). Genetic distances & Sequence-evolutionary models (Chapter 3, pages 79-88). In-class exercise: Poisson simulation & explain
! Omics Application !! Paper link !! Data set !! NGS Technology
* 10/31 (TH). Maximum parsimony (Chapter 5, pages 191-194). In-class exercise: parsimony scores
|-
* 11/4 (M). Distance methods (Chapter 5, pages 184-187). In class exercise: use APE package to calculate genetic distances
| Microbiome || [https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0193652 Rimoldi_etal_2018_PlosOne] || [https://doi.org/10.1371/journal.pone.0193652.s004 S1 Dataset] || 16S rDNA amplicon sequencing
* 11/7 (TH). Likelihood & Bayesian methods; Tree Testing (Chapter 5, pages 194-198).  
|-
* 11/11 (M). Review (Chapter 5, pages 207-209). Review exercises. Lecture slides: [[File:Part-3-tree-construction-small-2018.pdf|thumbnail]]
| Transcriptome || [https://science.sciencemag.org/content/350/6264/1096 Wang_etal_2015_Science] || Tables S2 & S4 || RNA-Seq
* 11/14 (TH). '''Midterm Exam 3'''
|-
 
| Transcriptome & Regulome || [https://bmcmedgenomics.biomedcentral.com/articles/10.1186/s12920-019-0477-8 Nava_etal_2019_BMCGenomics] || Tables S2 & S3 || RNA-Seq & CHIP-Seq
===Part 4. Mechanisms of molecular evolution===
|-
* 11/18 (M). Mechanism of molecular evolution: Overview (pages 35-38) & Rates of nucleotide substitutions (pages 111-125).
| Proteome || [https://www.ncbi.nlm.nih.gov/pubmed/28232952 Qiu_etal_2017_NPJ] || (to be posted) || SILAC
* 11/21 (TH). Ka/Ks test of natural selection (pg 116-124). In-class exercise
|-
* 11/25 (M). In-class computer exercise:
| Population genomics (Lyme) || [https://jcm.asm.org/content/56/11/e00940-18.long Di_etal_2018_JCM] || [https://github.com/weigangq/ocseq Data & R codes] || Amplicon sequencing (antigen locus)
{| class="wikitable sortable mw-collapsible"
|-
|- style="background-color:lightsteelblue;"
| Population genomics/GWAS (Human) || [https://science.sciencemag.org/content/351/6274/737.long Simonti_etal_2016_Science] || [https://science.sciencemag.org/highwire/filestream/673591/field_highwire_adjunct_files/1/aad2149-Simonti-SM.Table.S2.xlsx Table S2] || whole-genome sequencing (WGS); [http://www.internationalgenome.org/ 1000 Genome Project (IGSR)]
! Final project (20 pts). Due: 12/6, Thursday)
|-
|- style="background-color:white;"
| TB surveillance || [https://jcm.asm.org/content/53/7/2230 Brow_etal_2015]  || [https://www.ebi.ac.uk/ena/data/view/PRJEB9206 Sequence Archives]|| Whole-genome sequencing (WGS)
|  
|-
# Calculate genetic distances
| Example || Example || Example || Example
## Download or Copy/Paste [http://media.hhmi.org/biointeractive/activities/lizard/Anolis-DNA-sequences.txt the lizard DNA sequences] to your own computer and save the file as "anoles.txt"
|-
## Align the DNA sequences [http://www.phylogeny.fr/one_task.cgi?task_type=muscle using this website] and save the aligned DNA file ("Output->Alignment in Fasta format") as "anoles-aligned.txt" (No need to print or submit the above two DNA sequence files; save them in a folder)
| Example || Example || Example || Example
## Load library: library(ape)
|-
## Read alignment: mt = read.FASTA("anoles-aligned.txt")
| Example || Example || Example || Example
## Calculate raw distance: mt.raw = dist.dna(mt, model = "raw")
## Apply Juke-Cantor (one-parameter model) correction: mt.jc = dist.dna(mt, model = "JC")
## Apply Kimura(two-parameter model, for Ts and Tv) correction: mt.k80 = dist.dna(mt, model = "K80")
## Plot JC distance vs the raw distance: plot(mt.raw, mt.jc, xlab = "uncorrected distance (diff/site)", ylab = "corrected distance (sub/site)", xlim = c(0,0.4), ylim = c(0,0.5), las =1)
## Add a 1:1 line: abline(0,1, col = "red")
## Add K80 distances: points(mt.raw, mt.k80, pch = 3, col = "blue")
## Add a legend: legend(0.05, 0.45, legend = c("JC (1-parameter)", "K80 (2-parameter)"), pch = c(1,3), col = c("black","blue"), bty = "n")
## Export an PDF and print a copy
## Use the graph to explain (1) Why it is necessary to correct for raw distances when comparing sequences from distantly related species; (2) What is the key difference between the K80 and JC models
# Comparison of distance and parsimony trees (review previous assignments for detailed R-Studio instructions)
## In R studio, install & load the "ape" and "phangnorm" libraries
### Obtain a neighbor-joining tree using K80 model: tree.nj = NJ(mt.k80)
### Plot a midpoint rooted tree: plot(midpoint(tree.nj))
### Add a scale bar: add.scale.bar()
### Print tree and answer this question: what does the distance represent? What is the unit?
## Obtain a maximum parsimony tree
### Convert object to a different class: aln.phy = as.phyDat(mt)
### Search maximum parsimony tree.mp = optim.parsimony(tree.nj, aln.phy)
### Get tree distance: tree.mp = acctran(tree.mp, aln.phy)
### Plot tree: plot(midpoint(tree.mp))
### Add a scale bar: add.scale.bar()
### Print tree and answer the question: what does the distance represent? What is the unit?
## Compare the two trees and explain the differences in these two methods: Which one uses full sequence information and why?
# Bootstrap analysis
## aln.fas <- read.dna("anoles-aligned.txt", format ="fasta")
## Create a function for re-rooted distance tree: tree.fun = function(x) root(nj(dist.dna(x)), outgroup = c("Leiocephalus_barahonensis"), resolve.root = T)
## Calculate a tree: tr = tree.fun(aln.fas)
## Perform bootstrap for 100 pseudo-replicates:  boot.trees = boot.phylo(tr, aln.fas, tree.fun, B=100, rooted =T)
## Plot tree: plot(tr, no.margin = T)
## Add bootstrap values as node labels: nodelabels(boot.trees, bg= "white")
## Explain (1) Does bootstrap test for tree precision or tree accuracy? (2) What does a bootstrap value of 80% mean?
|}
|}
* 12/2 (M). SNP statistics & gene frequency analysis: In-class exercises.
* 12/5 (TH) Genetic Drift (pages 47-49). Lecture slides: [[File:Part-4-evol-mechanism-2018.pdf|thumbnail]]
* 12/9 (M). (Last Lecture) Review & Course evaluations. Final review slides: [[File:Final-review-2018.pdf|thumbnail]]
** '''Submit your Teacher's Evaluation''', using either:
** Personal computer at [http://www.hunter.cuny.edu/te www.hunter.cuny.edu/te]; or,
** Smartphone at [http://www.hunter.cuny.edu/mobilete www.hunter.cuny.edu/mobilete]
* Date & Time to be determined: '''Comprehensive Final  Exam'''

Revision as of 15:56, 4 October 2019

Molecular Evolution (BIOL 375.00/790.64/793.03, Fall 2019)
Instructor: Dr Weigang Qiu, Professor, Department of Biological Sciences
Room: 926 HN (Seminar Room, North Building)
Hours: Mon. & Thur 4:10-5:25 pm
Office Hours: Belfer Research Building (Google Map) BB-402; Fridays 3-5pm or by appointment
Course Website: http://diverge.hunter.cuny.edu/labwiki/Biol375_2019
christopher.panlasigui47@myhunter.cuny.edu

Borreliabase-screenshot-1.png

Course Description

Molecular evolution is the study of the change of DNA and protein sequences through time. Theories and techniques of molecular evolution are widely used in species classification, biodiversity, comparative genomics, and molecular epidemiology. Contents of the course include:

  • Population genetics, which is a theoretical framework for understanding mechanisms of sequence evolution through mutation, recombination, gene duplication, genetic drift, and natural selection.
  • Molecular systematics, which introduces statistical models of sequence evolution and methods for reconstructing species phylogeny.
  • Bioinformatics, which provides hands-on training on data acquisition and the use of software tools for phylogenetic analyses.

This 3-credit course is designed for upper-level biology-major undergraduates. Hunter pre-requisites are BIOL203, and MATH150 or STAT113.

Textbooks

  • (Required) Graur, 2016, Molecular and Genome Evolution, First Edition, Sinauer Associates, Inc. ISBN: 978-1-60535-469-9. Publisher's Website (Student discount: a 15% discount and receive free UPS standard shipping)

http://www.sinauer.com/molecular-and-genome-evolution.html)

  • (Recommended) Baum & Smith, 2013. Tree Thinking: an Introduction to Phylogenetic Biology, Roberts & Company Publishers, Inc.

Learning Goals

  • Be able to describe evolutionary relationships using phylogenetic trees
  • Be able to use web-based as well as stand-alone software to infer phylogenetic trees
  • Understand mechanisms of DNA sequence evolution
  • Understand algorithms for building phylogenetic trees

Links for phylogenetic tools

Exams & Grading

  • Bonus for full attendance & active participation in classroom discussions.
  • Assignments. All assignments should be handed in as hard copies only. Email submission will not be accepted. Late submissions will receive 10% deduction (of the total grade) per day.
  • Three Mid-term Exams (30 pts each)
  • Comprehensive Final Exam (50 pts)

Academic Honesty

While students may work in groups and help each other for assignments, duplicated answers in assignments will be flagged and investigated as possible acts of academic dishonesty. To avoid being investigated as such, do NOT copy anyone else's work, or let others copy your work. At the least, rephrase using your own words. Note that the same rule applies regarding the use of textbook and online resources: copied sentences are not acceptable and will be considered plagiarism.

Hunter College regards acts of academic dishonesty (e.g., plagiarism, cheating on examinations, obtaining unfair advantage, and falsification of records and official documents) as serious offenses against the values of intellectual honesty. The College is committed to enforcing the CUNY Policy on Academic Integrity and will pursue cases of academic dishonesty according to the Hunter College Academic Integrity Procedures.

Course Schedule

Part 1. Tree Thinking

  • 8/29 (TH). Overview & Introduction. Textbook Chapter: "Introduction" (pages 1-3)
Assignment 1 (10 pts; Due next class 9/5)
  • (10 pts) Pre-test: Full credits will be given as long as each question is answered with some reasoning. In other words, it will NOT be graded on being right or wrong. It's an assessment tool, to be compared with later test outcomes to show teaching/learning results.
  • 9/5 (TH). Introduction (Continued)
    • R terminologies
      • Object: variable that contains data (e.g., "iris")
      • Object class: type of data (e.g., "data.frame", which is a table)
      • Function: e.g., data(iris), which loads the data set called "iris"
      • Function arguments: input and options (e.g., "iris" above)
    • Tutorial: R & R-Studio (Bring your own computer)
    • Lecture slides:
Assignment 2 (5 pts; Due: next session)
R exercises
  1. Install R & R-studio (see "Links for phylogenetic tools" above)
  2. Open R-studio and install the "ape" package using the "Packages"->"Install" menu, located within the lower right window
  3. Type in the console window (lower left) the following commands (one at a time, wait for the prompt ">" to appear before proceed to the next command; quit & restart R-studio if stuck):
    1. library(ape)
    2. tr <- read.tree(text = "(monkey:0.09672,((tarsier:0.18996,lemur:0.14790)0.999:0.09005,(macaque:0.18524,(gibbon:0.10388,(orang-utan:0.09481,(human:0.03391,(gorilla:0.06135,chimpanzee:0.05141):0.01580)0.316:0.05381)1.000:0.03019)0.978:0.05616)0.997:0.05042)0.965:0.09672);")
    3. plot(tr)
  4. Export the tree graph using the "Export"->"Save as PDF" or "Save as Image" menu in the lower right window
  5. Exit R studio by typing the command "q()" and type "y" to answer the question for saving the R session
  6. Copy & paste the tree image into your document to be handed in
  • 9/9 (M). Intro to trees
    • Go over pre-test questions
    • In-class exercise 1 (5 pts)
    • Introduction to tree
  • 9/12 (TH). Intro to trees (continued)
    • In-class exercise 2. (5 pts)
    • Textbook Chapter 5: "Molecular Phylogenetics" (pages 170-175; 201-202)
  • 9/16 (M). Species Tree & Lineage Sorting.
    • Textbook Chapter 5: "Molecular Phylogenetics" (pages 177-180).
  • 9/19 (TH). Consensus Tree & Review.
  • 9/23 (M). 4:10 - 5:10pm Midterm Exam I Bring pencils, erasers, and a calculator

Part 2. Analysis of Trait Evolution

  • 9/26 (TH). Traits & trait matrix
    • Textbook Chapter 5, pages 180-183
    • R demo I (by Chris)
# iris dataset exercise
# load libraries
library(tidyverse)
library(datasets)
data('iris')

# summary of data
summary(iris)
glimpse(iris)
iris %>% glimpse()

# previewing data
head(iris)

# subsetting data
slice(iris, 1:3)
iris %>% slice(1:3)

# grouping and subsetting data
iris %>%
  group_by(Species) %>%
  slice(1:3)

iris %>%
  group_by(Species) %>%
  summarise(average = mean(Sepal.Length))

# filtering data
filter(iris, Species == 'versicolor')
iris %>%
  filter(Species == 'versicolor')

iris %>%
  filter(Sepal.Length >= 7)

# OR operation
iris %>%
  filter(Sepal.Length < 5 | Sepal.Length > 7)

# check distribution using histogram
ggplot(iris, aes(x = Sepal.Length)) +
  geom_histogram()

# distribution by Species
ggplot(iris, aes(x = Sepal.Length, fill = Species)) +
  geom_histogram(alpha = 0.5)

# distribution by Species using facetwrap
ggplot(iris, aes(x = Sepal.Length, color = Species)) +
  geom_histogram() + facet_wrap(~Species)

# boxplot
ggplot(iris, aes(y = Sepal.Length, x = Species)) +
  geom_boxplot()

# boxplot with points
ggplot(iris, aes(y = Sepal.Length, x = Species)) +
  geom_boxplot() +
  geom_jitter(size = 2, width = 0.1, alpha = 0.5, color = 'blue')

# scatterplot
ggplot(iris, aes(y = Sepal.Length, x = Petal.Length, color = Species)) + geom_point()
Assignment #3 (5 pts; Due next session)
Watch Origin of Species: Lizards in an Evolutionary Tree. Provide short answer (1-3 sentences) to each of the following three questions.
  1. What are the two hypotheses explaining the origin of different ecomorphs of lizards on Caribbean Islands?
  2. What is the expected phylogeny under each hypothesis?
  3. Which hypothesis is supported by the phylogeny of actual DNA sequences?
  • 10/3 (TH). Homoplasy & consistency
    • R Demo (part 2) (Crhis)
Assignment #4 (5 pts; Due next session)
  1. Download or Copy/Paste the lizard DNA sequences to your own computer and save the file as "lizard.txt"
  2. Align the DNA sequences using this website and save the aligned DNA file ("Output->Alignment in Fasta format") as "lizard-aligned.txt". Use "one-click" option in the Phylogeny Analysis tab to make a tree.
  3. Based on the lizard card, construct a character-state matrix for all lizard species. For each species, list its character state for each of the following two characters (as columns): (1) Geographic origin, and (2) Habitat.
  4. Determine which hypothesis ("Multiple origin" or "Single origin" of ecomorphs) is more supported by the mtDNA tree. Explain.
  • 10/7 (M). Parsimony reconstruction (Chapter 5).
    • Textbook Chapter 5, pages 188-191
  • 10/10 (TH). Parsimony reconstruction (Continued)
    • In-Class Exercise 4 (Due next session)
  • 10/16 (Wed. Monday Schedule). Genome & gene structure (Chapter 3)
  • 10/17 (TH). Review & Practices. Lecture Slides
    • In-Class Exercise 5. Pretest Part 2 (molecular phylogenetics in forensics)
  • 10/21 (M). Midterm Exam 2

Part 3. Tree Algorithms

  • 10/24 (TH). BLAST & Alignments (Chapter 3. pages 93-100). In-class exercise: Run BLAST; show alignment & explain E-value
  • 10/28 (M). Genetic distances & Sequence-evolutionary models (Chapter 3, pages 79-88). In-class exercise: Poisson simulation & explain
  • 10/31 (TH). Maximum parsimony (Chapter 5, pages 191-194). In-class exercise: parsimony scores
  • 11/4 (M). Distance methods (Chapter 5, pages 184-187). In class exercise: use APE package to calculate genetic distances
  • 11/7 (TH). Likelihood & Bayesian methods; Tree Testing (Chapter 5, pages 194-198).
  • 11/11 (M). Review (Chapter 5, pages 207-209). Review exercises. Lecture slides:
  • 11/14 (TH). Midterm Exam 3

Part 4. Mechanisms of molecular evolution

  • 11/18 (M). Mechanism of molecular evolution: Overview (pages 35-38) & Rates of nucleotide substitutions (pages 111-125).
  • 11/21 (TH). Ka/Ks test of natural selection (pg 116-124). In-class exercise
  • 11/25 (M). In-class computer exercise:
Final project (20 pts). Due: 12/6, Thursday)
  1. Calculate genetic distances
    1. Download or Copy/Paste the lizard DNA sequences to your own computer and save the file as "anoles.txt"
    2. Align the DNA sequences using this website and save the aligned DNA file ("Output->Alignment in Fasta format") as "anoles-aligned.txt" (No need to print or submit the above two DNA sequence files; save them in a folder)
    3. Load library: library(ape)
    4. Read alignment: mt = read.FASTA("anoles-aligned.txt")
    5. Calculate raw distance: mt.raw = dist.dna(mt, model = "raw")
    6. Apply Juke-Cantor (one-parameter model) correction: mt.jc = dist.dna(mt, model = "JC")
    7. Apply Kimura(two-parameter model, for Ts and Tv) correction: mt.k80 = dist.dna(mt, model = "K80")
    8. Plot JC distance vs the raw distance: plot(mt.raw, mt.jc, xlab = "uncorrected distance (diff/site)", ylab = "corrected distance (sub/site)", xlim = c(0,0.4), ylim = c(0,0.5), las =1)
    9. Add a 1:1 line: abline(0,1, col = "red")
    10. Add K80 distances: points(mt.raw, mt.k80, pch = 3, col = "blue")
    11. Add a legend: legend(0.05, 0.45, legend = c("JC (1-parameter)", "K80 (2-parameter)"), pch = c(1,3), col = c("black","blue"), bty = "n")
    12. Export an PDF and print a copy
    13. Use the graph to explain (1) Why it is necessary to correct for raw distances when comparing sequences from distantly related species; (2) What is the key difference between the K80 and JC models
  2. Comparison of distance and parsimony trees (review previous assignments for detailed R-Studio instructions)
    1. In R studio, install & load the "ape" and "phangnorm" libraries
      1. Obtain a neighbor-joining tree using K80 model: tree.nj = NJ(mt.k80)
      2. Plot a midpoint rooted tree: plot(midpoint(tree.nj))
      3. Add a scale bar: add.scale.bar()
      4. Print tree and answer this question: what does the distance represent? What is the unit?
    2. Obtain a maximum parsimony tree
      1. Convert object to a different class: aln.phy = as.phyDat(mt)
      2. Search maximum parsimony tree.mp = optim.parsimony(tree.nj, aln.phy)
      3. Get tree distance: tree.mp = acctran(tree.mp, aln.phy)
      4. Plot tree: plot(midpoint(tree.mp))
      5. Add a scale bar: add.scale.bar()
      6. Print tree and answer the question: what does the distance represent? What is the unit?
    3. Compare the two trees and explain the differences in these two methods: Which one uses full sequence information and why?
  3. Bootstrap analysis
    1. aln.fas <- read.dna("anoles-aligned.txt", format ="fasta")
    2. Create a function for re-rooted distance tree: tree.fun = function(x) root(nj(dist.dna(x)), outgroup = c("Leiocephalus_barahonensis"), resolve.root = T)
    3. Calculate a tree: tr = tree.fun(aln.fas)
    4. Perform bootstrap for 100 pseudo-replicates: boot.trees = boot.phylo(tr, aln.fas, tree.fun, B=100, rooted =T)
    5. Plot tree: plot(tr, no.margin = T)
    6. Add bootstrap values as node labels: nodelabels(boot.trees, bg= "white")
    7. Explain (1) Does bootstrap test for tree precision or tree accuracy? (2) What does a bootstrap value of 80% mean?