Monte Carlo Club and Southwest-University: Difference between pages

From QiuLab
(Difference between pages)
Jump to navigation Jump to search
imported>Weigang
mNo edit summary
 
imported>Weigang
 
Line 1: Line 1:
__FORCETOC__
<center>'''Biomedical Genomics'''</center>
==(This week) Feb 24, 2017 "Idiots" (Due 3/3/2017)==
<center>July 8-19, 2019</center>
* Source: Paul Nahin (2000), "Dueling Idiots". Problem #2: "When Idiots Duel"
<center>'''Instructor:''' Weigang Qiu, Ph.D.<br>Professor, Department of Biological Sciences, City University of New York, Hunter College & Graduate Center<br>Adjunct Faculty, Department of Physiology and Biophysics
* Game: Idiots A and B decide to duel with a gun in the following way: They will insert a single bullet into the gun's cylinder. The cylinder has a total of 6 slots. Idiot A will spin the cylinder and shoot at B. If the gun doesn't fire, then A will give the gun to B, who will spin the cylinder and then shoot at A. This back-and-forth duel will continue until one fool shoots (and kills) the other.
Institute for Computational Biomedicine, Weil Cornell Medical College</center>
* Questions: (1) what is the probability that A will win (and B dies); (2) What is the average number of trigger pulls before someone dies?
<center>'''Office:''' B402 Belfer Research Building, 413 East 69th Street, New York, NY 10021, USA</center>
<center>'''Email:''' weigang@genectr.hunter.cuny.edu</center>
<center>'''Lab Website:''' http://diverge.hunter.cuny.edu/labwiki/</center>
----
[[File:Lp54-gain-loss.png|400px|thumbnail|Figure 1. Gains & losses of host-defense genes among Lyme pathogen genomes (Qiu & Martin 2014)]]
==Course Overview==
Welcome to BioMedical Genomics, a computer workshop for advanced undergraduates and graduate students. A genome is the total genetic content of an organism. Driven by breakthroughs such as the decoding of the first human genome and next-generation DNA -sequencing technologies, biomedical sciences are undergoing a rapid and irreversible transformation into a highly data-intensive field.


==(Last week) Feb 17, 2017 "Birthday" (Due 2/24/2017)==
Genome information is revolutionizing virtually all aspects of life sciences including basic research, medicine, and agriculture. Meanwhile, use of genomic data requires life scientists to be familiar with concepts and skills in biology, computer science, as well as data analysis.
[[File:B-day.png|thumbnail]]
* Problem: What is the probability NONE of the N people in a room sharing a birthday?
# Randomly select N individuals and record their B-days
# Count the B-days NOT shared by ANY two individuals
# Repeat (for each N) 1000 times, and obtain probability by averaging the previous counts (i.e., divided by 1000)
# Vary N from 10 to 100, increment by 10
# Plot probability of no-shared B-Day (Y-axis) versus N (x-axis), with either a stripchart or boxplot, or both
{| class="wikitable sortable mw-collapsible"
|- style="background-color:lightsteelblue;"
! Submitted Codes
|- style="background-color:powderblue;"
|
* By weigang
<syntaxhighlight lang="bash">
days <- 1:365;


find.overlap <- function(x) { return(length(which(table(x)>1))) }
This workshop is designed to introduce computational analysis of genomic data through hands-on computational exercises, using published studies.


output <- sapply(1:100, function(x) { # num of people in the room
The pre-requisites of the course are college-level courses in molecular biology, cell biology, and genetics. Introductory courses in computer programming and statistics are preferred but not strictly required.
  ct.no.overlap <- 0;
  for (k in 1:100) {
    bdays <- sample(days, x, replace = T);
    ct <- find.overlap(bdays);
    if (!ct) { ct.no.overlap <- ct.no.overlap + 1}
  }
  return(ct.no.overlap);
})
plot(1:100, output/100, xlab="Group size", ylab = "Prob (no shared b-day)", las=1, main="B-day (sim=100 times)", type="l")
abline(h=0.5, lwd=2, col=2, lty=2)
abline(v=seq(0, 100, 5), col="gray")
</syntaxhighlight>
|}


==(Previous week) Feb 10, 2017 "Dating" (Valentine's Day Special; Due 2/17/2017)==
==Learning goals==
[[File:Dating.png|thumbnail]]
By the end of this course successful students will be able to:  
* Source: Paul Nahin (2008), "Digital Dice". Problem #20: "An Optimal Stopping Problem"
* Describe next-generation sequencing (NGS) technologies & contrast it with traditional Sanger sequencing
* Problem: What is the optimal time point when one should stop dating more people and settle on a mate choice (and live with the decision)
* Explain applications of NGS technology including pathogen genomics, cancer genomics, human genomic variation, transcriptomics, meta-genomics, epi-genomics, and microbiome.
* Your best strategy is to date an initial sample of N individuals, rejecting all, and marry the next one ranked higher than any of your N individuals. The question is what is the optimal number for N.
* Visualize and explore genomics data using RStudio
# The problem could be investigated by simulating a pool of 10 individuals, ranked from 1-10 (most desirable being 1) and then take a sample of N
* Replicate key results using a raw data set produced by a primary research paper
# You may only date one individual at a time
# You cannot go back to reach previously rejected candidates
# Simulate N from 0 to 9 (zero means marrying the first date, a sample size of zero)
# For each N, obtain the probability of finding the perfect mate (i.e., ranked 1st) by running simulation 1000 times
# Plot barplot of probability versus sample size N.
# Expected answer: N=4
{| class="wikitable sortable mw-collapsible"
|- style="background-color:lightsteelblue;"
! Submitted Codes
|- style="background-color:powderblue;"
|
* By Weigang
<syntaxhighlight lang="bash">
pick.candidate <- function(min, array) {
  for (i in 1:length(array)) {
    if (array[i] < min) {
      return(array[i])
    } else {next}
  }
  return(0) # No 1. has been sampled and rejected
}
candidates <- 1:10;
output <- sapply(0:9, function(x) {
  ct <- 0;
  for(k in 1:1000) {
    if (x==0) { # no sample, marry the 1st guy
      sampled <- sample(candidates, 1);
      if (sampled == 1) {ct <- ct+1}
    } else {
      sampled <- sample(candidates, x);
      not.sampled <- candidates[-sampled];
      not.sampled <- sample(not.sampled);
      if (pick.candidate(min = min(sampled), array = not.sampled) == 1) {ct <- ct+1}
    }
  }
  return(ct);
})
barplot(output/1e3, names.arg = 0:9, xlab = "number of sampled dates", las=1, main = "Optimal stopping for dating (N=10 candidates)", ylab = "Prob(marrying No.1)")
</syntaxhighlight>
|}
==(Previous week) Feb 3, 2017 "US Presidents" (Due 2/10/2017)==
[[File:Sim-presidents.png|thumbnail]]
* Download [[File:Presidents.txt|thumbnail]]: 1st column is the order, 2nd column is the name, the 3rd column is the year of inauguration; tab-separated
* Your job is to create an R, Perl, or Python script called “us-presidents”, which will
# Read the table
# Store the original/correct order
# Shuffle/permute the rows and record the new order
# Count the number of matching orders
# Repeat Steps 3-4 for a 1000 times
# Plot histogram or barplot (better) to show distribution of matching counts
# Hint: For R, use the sample() function. For Perl, use the rand() function.
{| class="wikitable sortable mw-collapsible"
|- style="background-color:lightsteelblue;"
! Submitted Codes
|- style="background-color:powderblue;"
|
* By Mei
<syntaxhighlight lang="bash">
pres.list <- lapply(1:1000, function(x) pres[sample(nrow(pres)),])
cts <- sapply(pres.list, function(x) {
  ct.match <- 0;
  for (i in 1:45){
    if (pres$order[i] == x[i,1]){
      #cat(as.character(x[i,2]), x[i,1], "\n")  #as.character to avoid the factor info
      ct.match <- ct.match + 1;
      #cat(ct.match)
    }
  }
  ct.match #return
})
barplot(table(cts),xlab = "Number of matches per shuffle", border = "hotpink", col = "pink", ylab = paste("Frequency total of:",length(pres.list)), main = "US Presidents")
</syntaxhighlight>
* By John
<syntaxhighlight lang="python">
import pandas as pd
import numpy as np
from pandas import DataFrame
import matplotlib.pyplot as plt


df = pd.read_table("presidents.txt", names=["num", "name", "presidency"])
==Web Links==
* Install R and R Studio
* Unix Tutorial
* Textbook


name_list = list(df.name)
==Quizzes and Exams==
Student performance will be evaluated by attendance, three (4) quizzes and a final report:
* Attendance: 50 pts
* Quizzes: 2 x 25 pts = 50 pts
* Mid-term: 50 pts
* Final presentation: 50 pts
Total: 200 pts


# create a list to store matches after each shuffle
==Course Schedule==
shuffle_record = []
{| class="wikitable"
|-
! Date & Hour !! Tutorial & Lecture !! Paper !! Quiz & Exam
|-
| July 8 (Mon), 8:40-12:10 || Introduction; R Tutorial I; NGS || NGS ||
|-
| July 9 (Tu), 8:40-12:10 || NGS; R Tutorial II || NGS ||
|-
| July 10 (Wed), 8:40-12:10 || Microbial ecology I; R Tutorial III || Fish diet || Quiz I
|-
| July 11 (Thur), 8:40-12:10 || Microbial ecology II; R Tutorial IV || Lyme OspC (cont'd) ||
|-
| July 12 (Fri), 8:40-12:10 || Functional genomics I || Essential genes (RNA-SEQ/CHIP-SEQ) || Mid-term Exam
|-
| July 15 (Mon), 8:00-12:10 || Functional genomics II || Cancer proteomics ||
|-
| July 16 (Tu), 8:00-12:10 || Genome variation I || TB || Quiz II
|-
| July 17 (Wed), 8:00-12:10 || Genome variation II || Human ||
|-
| July 18 (Thur), 8:00-12:10 || Human genomic variations || Paper 5 (cont'd) ||
|-
| July 19 (Fri), 8:00-12:10|| Example || Example || Presentations
|}


# create a function in which the first argument is the original dataset; second argument is number of shuffles
==Papers & Datasets==
def shuffler2(original, n):
{| class="wikitable sortable"
    record = []
|-
    for i in range(n):
! Omics Application !! Paper link !! Data set !! NGS Technology
        num = 0
|-
        each_shuffle = {}
| Microbiome || [https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0193652 Rimoldi_etal_2018_PlosOne] || [https://doi.org/10.1371/journal.pone.0193652.s004 S1 Dataset] || 16S rDNA amplicon sequencing
        temp = original.reindex(np.random.permutation(original.index)) # do shuffling for each
|-
        compare = original.num == temp.num
| Transcriptome || [https://science.sciencemag.org/content/350/6264/1096 Wang_etal_2015_Science] || Tables S2 & S4 || RNA-Seq
        matched_df = original.ix[compare[compare == True].index]
|-
        for i in matched_df.index:
| Transcriptome & Regulome || [https://bmcmedgenomics.biomedcentral.com/articles/10.1186/s12920-019-0477-8 Nava_etal_2019_BMCGenomics] || Tables S2 & S3 || RNA-Seq & CHIP-Seq
            each_shuffle[i] = matched_df.name[i]
|-
        shuffle_record.append(each_shuffle)
| Proteome || [https://www.ncbi.nlm.nih.gov/pubmed/28232952 Qiu_etal_2017_NPJ] || (to be posted) || SILAC
        try:
|-
            num = compare.value_counts()[1]
| Population genomics (Lyme) || [https://jcm.asm.org/content/56/11/e00940-18.long Di_etal_2018_JCM] || [https://github.com/weigangq/ocseq Data & R codes] || Amplicon sequencing (antigen locus)
        except:
|-
            pass
| Population genomics/GWAS (Human) || [https://science.sciencemag.org/content/351/6274/737.long Simonti_etal_2016_Science] || [https://science.sciencemag.org/highwire/filestream/673591/field_highwire_adjunct_files/1/aad2149-Simonti-SM.Table.S2.xlsx Table S2] || whole-genome sequencing (WGS); [http://www.internationalgenome.org/ 1000 Genome Project (IGSR)]
        record.append(num)
|-
    return(record)
| TB surveillance || [https://jcm.asm.org/content/53/7/2230 Brow_etal_2015] || [https://www.ebi.ac.uk/ena/data/view/PRJEB9206 Sequence Archives]|| Whole-genome sequencing (WGS)
result2 = shuffler2(df, 1000)
|-
 
| Example || Example || Example || Example
plt.hist(result2, color="yellow")
|-
plt.title("Histgram for President Data")
| Example || Example || Example || Example
plt.show()
|-
</syntaxhighlight>
| Example || Example || Example || Example
* By Weigang
<syntaxhighlight lang="bash">
p <- read.table("Presidents.txt", sep="\t", header=F)
colnames(p) <- c("order", "name", "inaug.year")
# Use "sapply" or "lapply" for loops: no need to pre-define a vector to store results
p.sim <- sapply(1:10000, function (x) {
  length(which(sample(p$order) == p$order))
  })
barplot(table(p.sim)/1e4, las=1)
p.exp <- rpois(1e4, 1) # draw 10000 Poisson random deviates
mp <- barplot(table(p.exp)/1e4, las=1, xlab = "Num of matching presidents") # mid-point on x-axis
lines(mp[1:7], table(p.sim)/1e4, type="b", col=2) # add a line (Poisson-expected) to the barplot (simulated)
legend("topright", c("Simulated", "Poisson expectation"), col=1:2, lty=1)
</syntaxhighlight>
|}
|}
==(Future weeks) March 3, 2017 (Due March 10, 2017) Pi & E===

Revision as of 01:59, 23 June 2019

Biomedical Genomics
July 8-19, 2019
Instructor: Weigang Qiu, Ph.D.
Professor, Department of Biological Sciences, City University of New York, Hunter College & Graduate Center
Adjunct Faculty, Department of Physiology and Biophysics Institute for Computational Biomedicine, Weil Cornell Medical College
Office: B402 Belfer Research Building, 413 East 69th Street, New York, NY 10021, USA
Email: weigang@genectr.hunter.cuny.edu
Lab Website: http://diverge.hunter.cuny.edu/labwiki/

Figure 1. Gains & losses of host-defense genes among Lyme pathogen genomes (Qiu & Martin 2014)

Course Overview

Welcome to BioMedical Genomics, a computer workshop for advanced undergraduates and graduate students. A genome is the total genetic content of an organism. Driven by breakthroughs such as the decoding of the first human genome and next-generation DNA -sequencing technologies, biomedical sciences are undergoing a rapid and irreversible transformation into a highly data-intensive field.

Genome information is revolutionizing virtually all aspects of life sciences including basic research, medicine, and agriculture. Meanwhile, use of genomic data requires life scientists to be familiar with concepts and skills in biology, computer science, as well as data analysis.

This workshop is designed to introduce computational analysis of genomic data through hands-on computational exercises, using published studies.

The pre-requisites of the course are college-level courses in molecular biology, cell biology, and genetics. Introductory courses in computer programming and statistics are preferred but not strictly required.

Learning goals

By the end of this course successful students will be able to:

  • Describe next-generation sequencing (NGS) technologies & contrast it with traditional Sanger sequencing
  • Explain applications of NGS technology including pathogen genomics, cancer genomics, human genomic variation, transcriptomics, meta-genomics, epi-genomics, and microbiome.
  • Visualize and explore genomics data using RStudio
  • Replicate key results using a raw data set produced by a primary research paper

Web Links

  • Install R and R Studio
  • Unix Tutorial
  • Textbook

Quizzes and Exams

Student performance will be evaluated by attendance, three (4) quizzes and a final report:

  • Attendance: 50 pts
  • Quizzes: 2 x 25 pts = 50 pts
  • Mid-term: 50 pts
  • Final presentation: 50 pts

Total: 200 pts

Course Schedule

Date & Hour Tutorial & Lecture Paper Quiz & Exam
July 8 (Mon), 8:40-12:10 Introduction; R Tutorial I; NGS NGS
July 9 (Tu), 8:40-12:10 NGS; R Tutorial II NGS
July 10 (Wed), 8:40-12:10 Microbial ecology I; R Tutorial III Fish diet Quiz I
July 11 (Thur), 8:40-12:10 Microbial ecology II; R Tutorial IV Lyme OspC (cont'd)
July 12 (Fri), 8:40-12:10 Functional genomics I Essential genes (RNA-SEQ/CHIP-SEQ) Mid-term Exam
July 15 (Mon), 8:00-12:10 Functional genomics II Cancer proteomics
July 16 (Tu), 8:00-12:10 Genome variation I TB Quiz II
July 17 (Wed), 8:00-12:10 Genome variation II Human
July 18 (Thur), 8:00-12:10 Human genomic variations Paper 5 (cont'd)
July 19 (Fri), 8:00-12:10 Example Example Presentations

Papers & Datasets

Omics Application Paper link Data set NGS Technology
Microbiome Rimoldi_etal_2018_PlosOne S1 Dataset 16S rDNA amplicon sequencing
Transcriptome Wang_etal_2015_Science Tables S2 & S4 RNA-Seq
Transcriptome & Regulome Nava_etal_2019_BMCGenomics Tables S2 & S3 RNA-Seq & CHIP-Seq
Proteome Qiu_etal_2017_NPJ (to be posted) SILAC
Population genomics (Lyme) Di_etal_2018_JCM Data & R codes Amplicon sequencing (antigen locus)
Population genomics/GWAS (Human) Simonti_etal_2016_Science Table S2 whole-genome sequencing (WGS); 1000 Genome Project (IGSR)
TB surveillance Brow_etal_2015 Sequence Archives Whole-genome sequencing (WGS)
Example Example Example Example
Example Example Example Example
Example Example Example Example