Biol20N02 2016
Course Description
With rapid accumulation of genome sequences and digitalized health data, biomedicine is becoming a dataintensive science. This course is a handson, computerbased workshop on how to visualize and analyze large quantities of biological data. The course introduces R, a modern statistical computing language and platform. Students will learn to use R to make scatter plots, bar plots, box plots, and other commonly used datavisualization techniques. The course will review statistical methods including hypothesis testing, analysis of frequencies, and correlation analysis. Student will apply these methods to the analysis of genomic and health data such as wholegenome gene expressions and SNP (singlenucleotide polymorphism) frequencies.
This 3credit experimental course fulfills elective requirements for Biology Major I. Hunter prerequisites are BIOL100, BIOL102 and STAT113.
Learning Goals
 Be able to use R as a plotting tool to visualize largescale biological data sets
 Be able to use R as a statistical tool to summarize data and make biological inferences
 Be able to use R as a programming language to automate data analysis
Textbooks
 Whitlock & Schluter (2015). Analysis of Biological Data. (2nd edition). Amazon link
 R Studio (Recommended): Learning RStudio for R Statistical Computing
 Digital textbook (Recommended): Data Analysis for the Life Sciences
Exams & Grading
 Attendance (or a note in case of absence) is required
 InClass Exercises (50 pts).
 Assignments. All assignments should be handed in as hard copies only. Email submission will not be accepted. Late submissions will receive 10% deduction (of the total grade) per day (~100 pts total).
 Three Midterm Exams (3 X 30 pts each = 90 pts)
 Comprehensive Final Exam (50 pts)
 Bonus for active participation in classroom discussions
Course Outline
Feb 2. Introduction & tutorials for R/R studio
 Course overview
 Install R & RStudio on your home computers (Chapter 1. pg. 9)
 Tutorial 1: First R Session (pg. 12)
 Create a new project by navigating: File  New Project  New Directory. Name it project file "Abalone"
 Import abalone data set: Tools  Import DataSet  From Web URL, copy & paste this address: http://archive.ics.uci.edu/ml/machinelearningdatabases/abalone/abalone.data
 Assign column names:
colnames(abalone) < c("Sex", "Length", "Diameter", "Height", "Whole_Weight", "Shucked_weight", "Viscera_weight", "Shell_weight", "Rings")
 Save data into a file:
write.csv(abalone, "abalone.csv", row.names = FALSE)
 Create a new R script: File  New  R script. Type the following commands:
abalone < read.csv("abalone.csv"); boxplot(Length ~ Sex, data = abalone)
 Save as "abalone.R" using File  Save
 Execute R script:
source("abalone.R")
 Install the notebook package:
install.packages("knitr")
 Compile a Notebook: File  Compile Notebook  HTML  Open in Browser
 Tutorial 2. Writing R Scripts (Chapter 2. pg. 21)
 Tutorial 3. Vector
Assignment #1. Due 2/16, Tuesday (Finalized) 


Feb 9. No class (Friday Schedule)
Feb 16. Introduction & tutorials for R/R studio
 Start a new project called "Session02individual"
 Tutorial 3: Vector (Continued)
x < c(1,2,3,4,5) # construct a vector using the c() function
x # show x
2 * x + 1 # arithmetic operations, applied to each element
exp(x) # exponent function (base e)
x < 1:5 # alternative way to construct a vector, if consecutive
x < seq(from = 1, to = 14, by = 2) # use the seq() function to create a numeric series
x < rep(5, times = 10) # use the rep() function to create a vector of same element
x < rep(NA, times = 10) # predefine a vector with unknown elements; Use no quotes
# Apply vector functions
length(x)
sum(x)
mean(x)
range(x)
# Access vector elements
x[1]
x[1:3]
x[2]
# Character vectors
gender < c("male", "female", "female", "male", "female")
gender[3]
# Logical vectors
is.healthy < c(TRUE, TRUE, FALSE, TRUE, FALSE) # Use no quotes
is.male < (gender == "male") # obtain a logic vector by a test
age < c(60, 43, 72, 35, 47)
is.60 < (age == 60)
less.60 < (age <= 43)
is.female < !is.male # use the logical negate operator (!)
# The which() function returns the indices of TRUE elements
ind.male < which(is.male)
ind.young < which(age < 45)
age[ind.young] # obtain ages of young individuals
 Tutorial 4: Matrix
BMI < c(28, 32, 21, 27, 35) # a vector of bodymass index
bp < c(124, 145, 127, 133, 140) # a vector of blood pressure
data.1 < cbind(age, BMI, bp) # create a matrix using column bind function cbind(), individuals in rows
data.1
data.2 < rbind(age, BMI, bp) # create a matrix using row bind function rbind()
t(data.1) # transpose a matrix: columns to rows & rows to columns
dim(data.1) # dimension of the matrix
colnames(data.1)
rownames(data.1) < c("subject1", "subject2", "subject3", "subject4", "subject5")
data.1
data.1[3,1] # access the element in row 3, column 1
data.1[2,] # access all elements in row 2
data.1[,2] # access all elements in column 2
matrix(data = 1:12, nrow = 3, ncol =4) # create a matrix with three rows and four columns; filled by column
matrix(data = 1:12, nrow = 3, ncol =4, byrow = TRUE) # filled by row
mat < matrix(data = NA, nrow = 2, ncol = 3) # create an empty matrix
mat[1,3] < 5 # assign a value to a matrix element
Assignment #2. Due 2/23, Tuesday (Finalized on 2/18, Thursday 10am) 

Show commands and outputs for the following exercises:

Feb 23. Statistics & samples
 Tutorial 5. Data Frame: a table to store mixed data types
data.df < data.frame(age, gender, is.healthy)
data.df
class(data.df) # check object type
factor(gender) # categories (called "levels") of a character vector
data.df[3,4] # access row 3, column 4
data.df[, "age"] # a vector of all ages
data.df$age # an alternative way, using the $ notation
data.df$BMI[4]
data.df$gender[2]
# Create a data frame from text files:
# Download and save the file: http://extras.springer.com/2012/9781461413011/BodyTemperature.txt
BodyTemperature < read.table(file = "BodyTemperature.txt", header = TRUE, sep = " ")
head(BodyTemperature) # show first 10 lines
names(BodyTemperature) # show column headings
BodyTemperature[1:3, 2:4] # show a slice of data
BodyTemperature$Age[1:3] # show 13 ages
 Population and sample
x < rnorm(1000)
x.sample.1 < sample(x, 100)
 Explore variable distributions
x < rnorm(1000)
hist(x, breaks = 100) # distribution for all nubmers
hist(BodyTemperature[,2], main = "Age frequency distribution", xlab = "Age", ylab = "Counts") # age distribution, with customized title and axis labels
stem(BodyTemperature[,2]) # stemleaf plot
table(BodyTemperature[,1]) # distribution for a categorical vector
Assignment #3. Due 3/1, Tu (Finalized) 


March 1. Displaying data
Slides for part 1: File:Biostatpart1.pdf
Assignment #4. Due 3/8, Finalized 


March 8. Exam 1 (OpenBook)
March 15. Describing data
 InClass exercise: A study of human gene length
 Import human gene data set from http://diverge.hunter.cuny.edu/~weigang/datasetsforbiostat/hg.tsv2
hg.len < hg$Gene.End  hg$Gene.Start + 1 # calculate gene length
hist(hg.len, br = 200) # plot genelength distribution. Not normal: mostly genes are short, few very long
mean(hg.len) # not representative, superlong genes carry too much weight to the average length
median(hg.len) # More representative. Use median for a variable not normally distributed
summary(hg.len) # Show all quartiles
IQR(hg.len) # 3rd Quartile  1st Quartile, the range of majority data points, even for skewed distribution
log.len < log10(hg.len); hist(log.len, br=200) # Log of gene length is more normally distributed
mean(log.len); median(log.len) # They should be similar, since log.len is normal
# The next block is intend to show that the "mean length" of samples is normally distributed, although the length itself is not
samp.len < sample(hg.len, 100) # take a random sample of 100 length
mean(samp.len) # a sample mean
# Repeat the above 1000 times, so we could study the distribution of "mean length" (not "length" itself)
mean.len < rep(NA, 1000) # prepare an empty vector to store the "mean lengths"
for (i in 1:1000) { # i takes the value from 1 to 1000, one at a time
samp.len < sample(hg.len, 100);
mean.len[i] < mean(samp.len);
}
hist(mean.len, br=100) # you should see a more normally distributed histogram
# The above exercise is a demonstration of the "Central Limit Theorem"
Assignment #5. Due 3/22. (Finalized) 

# Sample size 100
samp.size.100.means < rep(NA, 1000)
for (i in 1:1000) {
samp < sample(hg.len, 100)
samp.size.100.means[i] < mean(samp)
}
hist(samp.size.100.means, br=100)
# Sample size 20
samp.size.20.means < rep(NA, 1000)
for (i in 1:1000) {
samp < sample(hg.len, 20)
samp.size.20.means[i] < mean(samp)
}
hist(samp.size.20.means, br=100)
# Sample size 500
samp.size.500.means < rep(NA, 1000)
for (i in 1:1000) {
samp < sample(hg.len, 500)
samp.size.500.means[i] < mean(samp)
}
hist(samp.size.500.means, br=100)
# Combine
sample.combined < cbind(samp.size.20.means, samp.size.100.means, samp.size.500.means)
colnames(sample.combined) < c("samp.20", "samp.100", "samp.500")
# plot in a single frame
par(mfrow=c(3,1))
hist(sample.combined[,1], br=100, xlim=c(1e4, 2e5), main="sample size 20", xlab = "mean gene length")
hist(sample.combined[,2], br=100, xlim=c(1e4, 2e5), main = "sample size 100", xlab = "mean gene length")
hist(sample.combined[,3], br=100, xlim=c(1e4, 2e5), main = "sample size 500", xlab = "mean gene length")
par(mfrow =c(1,1)) 
March 22. Sampling & Standard Error of Mean
 InClass exercise 1. Descriptive statistics
 Make a vector of the following blood pressure measurements (in mmHg): 112, 128, 108, 129, 125, 153, 155, 132, 137. Calculate sample size, sum, mean, variance, coefficient of variation (CV), and median
 Take a sample of 100 human gene lengths. Calculate median, IQR, 1.5*IQR; Make a boxplot
 The following are measurements of body mass (in grams) of three species of finches in Africa, calculate mean, standard deviation, and CV of each species. Make a boxplot and a strip chart separated by species
 Species 1: 8, 8, 8, 8, 8, 8, 8, 6, 7 ,7, 7, 8, 8, 8, 7, 7
 Species 2: 16, 16, 16, 12, 16, 15, 15, 17, 15, 16, 15, 16
 Species 3: 40, 43, 37, 38, 43, 33, 35, 37, 36, 42, 36, 36, 39, 37, 34, 41
 InClass exercise 2. standard error & confidence interval
 Blood pressure: What is the standard deviation of the above blood pressure?
 What is the sample size? Calculate standard error of the mean.
 Use the 2SE rule of thumb, calculate 95% confidence interval.
 Plot standard error & standard deviation
Assignment #6. Due 3/29 (Finalized) 

A study of expression levels of human genes

March 29. Hypothesis Testing
 InClass exercise 3. Hypothesis testing through simulation
# coinflipping experiments
runif(1) # take a random sample from 01, uniformly distributed
rbinom(n = 1, size =100, prob = 0.5) # flipping 100 (size) fair (prob) coin, one (n=1) time
rbinom(n = 1000, size =100, prob = 0.5) # repeat above 1000 times
num.success < rbinom(n = 1000, size =100, prob = 0.5) # save
barplot(table(num.success)) # distribution of number of successes
length(which(num.success<=40))/1000 # probability of success less than or equal to 40
# test if toads are righthanded: observation 14/18 are righthanded
right.handed.toads.by.chance < rbinom(n = 1000, size = 18, prob = 0.5) # null distribution, 1000 times
barplot(table(right.handed.toads.by.chance)) # plot
length(which(right.handed.toads.by.chance >= 14))/1000 # probability of getting a value equal or higher than 14
# If the observation is 10/18
right.handed.toads.by.chance < rbinom(n = 1e6, size = 18, prob = 0.5)
length(which(right.handed.toads.by.chance <= 8  right.handed.toads.by.chance>=10))/1e6
 Lecture Slides File:Part2distribution.pptx
Assignment #7. Due 4/5 (Finalized)  


April 5. Exam 2 (Open Book)
April 12. Analyzing Proportions
Assignment #8. Due 4/19 (Finalized) 


April 19. Contingency Analysis
Lecture Slides: File:Part3frequency.pptx
Assignment #9. Due 5/3 (Finalized)  

The following table shows results of genotype counts in "Taster" and "nonTaster" individuals.

April 26. No Class (Spring break)
May 3. Exam 3 (Open Book)
 Review lecture slides (part 3)
 Review 2 previous exams
May 10. Onesample ttest
 Motivating example: weights of ticks
 Import data set: http://diverge.hunter.cuny.edu/~weigang/datasetsforbiostat/tick.tsv
 How to visualize data? What kinds of plots to make?
 Is there sexual dimorphism?
 No homework assignment (to be combined with the next one)
May 17. Paired & Two sample ttests
 Import data set: http://diverge.hunter.cuny.edu/~weigang/datasetsforbiostat/ahus.csv
 PLEASE FILL IN TEACHER EVALUATIONS: Teacher's evaluation
 All lecture slides
 Part 1. File:Biostatpart1.pdf
 Part 2. File:Part2distribution.pptx
 Part 3. File:Part3frequency.pptx
 Part 4. File:Part4ttests.pptx
Assignment #10. Due 5/24 (Finalized)  

