Biol20N02 2017 and Monte Carlo Club: Difference between pages

From QiuLab
(Difference between pages)
Jump to navigation Jump to search
imported>Weigang
 
imported>Weigang
 
Line 1: Line 1:
<center>'''Analysis of Biological Data''' (BIOL 20N02, Spring 2017)</center>
__FORCETOC__
<center>'''Instructor:''' Dr Weigang Qiu, Associate Professor, Department of Biological Sciences </center>
=Season V. Genes, Memes, and Machines (Spring 2022)=
<center>'''Room:''' 1001B HN (North Building, 10th Floor, Mac Computer Lab)</center>
* A journal club to continue the exploration of the link between evolution & learning.
<center>'''Hours:''' Thursdays 11:10-1:40</center>
* [https://pages.cs.wisc.edu/~dyer/cs540/handouts/info-theory-primer.pdf A primer for information theory]
<center>'''Office Hours:''' Belfer Research Building ([https://www.google.com/maps/place/413+E+69th+St,+New+York,+NY+10021/@40.7655886,-73.9561743,17z/data=!3m1!4b1!4m2!3m1!1s0x89c258c3d235f76f:0x4f3d0d5d8a78fe6?hl=en Google Map]) BB-402; Tuesdays 5-7 pm or by appointment</center>
* Data ≠ Information: Scharf (2021). The Ascent of Information: Books, Bits, Genes, Machines, and Life's Unending Algorithms. [https://www.amazon.com/Ascent-Information-Machines-Unending-Algorithm/dp/0593087240 Amazon link]
<center>'''Course Website:''' http://diverge.hunter.cuny.edu/labwiki/Biol20N2_2017</center>
* The "It's the song not the singer" (ITSNTS) theory of selection unit: [https://pubmed.ncbi.nlm.nih.gov/29581311/ Doolittle & Inkpen (2018).] "Processes and patterns of interaction as units of selection: An introduction to ITSNTS thinking", PNAS.
----
* Evolution is an AI machine: [https://www.oreilly.com/radar/open-endedness-the-last-grand-challenge-youve-never-heard-of/ Stanley, Lehman, and Soros (2017).] "Open-endedness: The last grand challenge you’ve never heard of - While open-endedness could be a force for discovering intelligence, it could also be a component of AI itself." O'Reily
==Course Description==
* An algorithmic definition of individuality: [https://pubmed.ncbi.nlm.nih.gov/32212028/ Krakauer et al (2020)]. "The information theory of individuality". Theory BioSci
With rapid accumulation of genome sequences and digital health data, biomedicine is becoming an information science.  This course is a hands-on, computer-based workshop on how to visualize and analyze biological data. The course introduces R, a modern statistical computing language and platform. In the first half, students will learn to use R to make scatter plots, bar plots, box plots, and other commonly used data-visualization techniques. In the second half, the course will review & apply statistical hypothesis tests including significance testing of means, association tests, and correlation analysis. Throughout the course, students will apply these methods to the analysis of large biological data sets, such as the human genome, transcriptomes (RNA-SEQ), and human genome variations.
* Evolution towards complexification: [https://aip.scitation.org/doi/abs/10.1063/1.3643064 Krakauer (2011)]. "Darwinian demons, evolutionary complexity, and information maximization". Chaos.


This 3-credit experimental course fulfills elective requirements for Biology Major I. Hunter pre-requisites are BIOL100, BIOL102 and STAT113.
=Season IV. Classification using Machine Learning (Summer 2021)=
* Textbook: Aurélien Géron (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, 2nd Edition. [https://www.amazon.com/gp/product/1492032646/ref=ppx_yo_dt_b_asin_title_o00_s00?ie=UTF8&psc=1 Amazon link]
* Set up Python work environment with [https://docs.conda.io/en/latest/miniconda.html mini-conda]
* We will be using Jupyter-notebook (part of mini-codon installation) to share codes
==Week 1. MNIST dataset (Chapter 3)==
# Dataset loading and display (pg 85-87): Jackie, Niemah, and Hannah
# Binary classifier
## Cross-validation (pg 89-90): Roman, etc
## Confusion matrix (pg 90-92):
## Precision and recall (pg 92-97)
## ROC curve (pg. 97-100)
# Multiclass classification (pg 100-108): Brian, etc


==Learning Goals==
==Week 2. K-means clustering (Chapter 9)==
* Be able to use R as a plotting tool to visualize large-scale biological data sets
# 2D simulated dataset
* Be able to use R as a statistical tool to summarize data and make biological inferences 
# Image recognition
* Be able to use R as a programming language to automate data analysis
# MNIST dataset
==Week 3. Exercises (pg.275-276, Chapter 9)==
# Exercise 10: Facial recognition (Olivetti faces dataset, with k-means)
# Exercise 11. Facial recognition (semi-supervised learning


==Textbooks==
=Notes on Origin of Life (Spring 2020)=
* Whitlock & Schluter (2015). Analysis of Biological Data. (2nd edition). [http://www.amazon.com/Analysis-Biological-Data-Michael-Whitlock/dp/1936221489/ref=sr_1_1?ie=UTF8&qid=1455723611&sr=8-1&keywords=whitlock+and+schluter%27s+analysis+of+biological+data Amazon link]
# [https://journals.plos.org/plosone/article/metrics?id=10.1371/journal.pone.0224552 Attie et al (2019). Genetic Code optimized as a traveling salesman problem]
# [https://itsatcuny.org/calendar/self-organizing-systems-and-the-origin-of-life GC Origin of Life Seminar (2/19/2021)]
# [https://www.cell.com/current-biology/fulltext/S0960-9822(15)00681-8?_returnURL=https%3A%2F%2Flinkinghub.elsevier.com%2Fretrieve%2Fpii%2FS0960982215006818%3Fshowall%3Dtrue Pressman et al (2015). Review: RNA World (ribozyme as origin of genetic code)]
# A perspective: [https://www.nature.com/articles/s42256-020-00278-8 Miikkulainen, R., Forrest, S. A biological perspective on evolutionary computation. Nat Mach Intell 3, 9–15 (2021)]


==Exams & Grading==
=Season III. Summer 2018 (Theme: Evolutionary Computing)=
* Attendance (or a note in case of absence) is required
==Week 1. Introduction & Motivating Examples==
* In-Class Exercises (50 pts).
# Genetic Arts
* Assignments. All assignments should be handed in as hard copies only. Email submission will not be accepted. Late submissions will receive 10% deduction (of the total grade) per day (~100 pts total).
## [http://picbreeder.org Go to picbreeker website] & evolve using "branch" method
* Three Mid-term Exams (3 X 30 pts each = 90 pts)
## Evolve 3D art: [http://endlessforms.com Endless Forms]
* Comprehensive Final Exam (50 pts)
## CPPN-NEAT Algorithm: Compositional Pattern Producing Networks (CPPNs)-NeuroEvolution of Augmenting Topologies (NEAT); [https://www.ncbi.nlm.nih.gov/pubmed/20964537 PicBreeder paper]; or [http://campbellssite.com/papers/secretan_chi08.pdf a PDF version]
* Bonus for active participation in classroom discussions
# [https://www.nature.com/articles/s41586-018-0102-6 NeuroEvolution (by DeepMind team)]
# Robotic snake (and other soft robots)
## A controller built with physics laws difficult (too many parameters)
## Simulation with evolutionary computing:
### (Genotype) A list of 13 commands (one for each segment; each being a neural net, with 25 inputs and one output of joint angles)
### (Phenotype) Fitness function: total displacement
### Algorithm: [https://medium.com/@devonfulcher3/the-map-elites-algorithm-finding-optimality-through-diversity-def6dcbc0f5b MAP-Elites]; [https://www.nature.com/articles/nature14422 Nature paper]
### Approach: training with simulated data
# Robotic Knightfish
## [https://www.youtube.com/watch?v=3XjgZbs0t2g Youtube demo]
## Algorithm: [https://en.wikipedia.org/wiki/CMA-ES CMA-ES (Covariance Matrix Adaptation Evolution Strategy)]
### Genotype: 15 variables (sinusoidal wave function or Fourier Series)
### Phenotype/Fitness: speed
## Implementation: DEAP
# Compositional Protein design (CPD)
## Genotype: side-chain configuration determined by [https://www.ncbi.nlm.nih.gov/pubmed/8464064 rotamer library]
## Phenotype/Fitness: Rosetta energy function & functional (e.g., binding affinity)
## Fitness landscape: each node is a protein structure, each edge represent connections/relatedness
# Simulation-based optimization: [http://simopt.org/ Problem Sets]
# Self-evolving software([http://geneticprogramming.com/ Genetic Programming])


==Course Outline==
==Week 2. Toy Problem: OneMax Optimization==
===Feb 2. Introduction & R Demo===
# Problem: Create a list of L random bits (0 or 1). Evolve the list until the fitness reaches the maximum value (i.e., contains only 1's)
# Course overview
# Neutral Evolution:
# Tutorial 1: R Demo
## Create a vector of L random bits (e.g, L=20). Hint for creating a random vector of 0's and 1's in R: <code>ind<-sample(c(0,1), prob=c(0.5.0.5), replace=T, size=20)</code>. Biologically, this vector represents a single haploid genome with 20 loci, each with two possible alleles (0 or 1).
* Create a new project by navigating: File | New Project | New Directory. Name it project file "human_genes"
## Create a population of N=100 such individuals. Hint: creating a list of vectors in R: <code>pop <- lapply(1:100, function(x){<insert sample() function above>})</code>
* Import the human genes data set: File | Import DataSet | CSV, copy & paste this address: http://diverge.hunter.cuny.edu/~weigang/data-sets-for-biostat/hg.tsv2
## For each generation, each individual reproduces 10 gametes with mutation (with the probability for bit flip: mu = 1/L = 0.1). Hint: write a mutation function that takes an individual as input and outputs a mutated gamete. Use the "broken stick" algorithm to implement mutation rate: <code>cutoff <- runif(1); ifelse(cutoff <= mu, flip-bits, no-flip)</code>
* Click "update". Rename the data set if you wish (short but informative names, e.g., hg, or human.genes). Do not use spaces, use dot or underscore as name delimiters (e.g., "human.genes" or "human_genes", but never "human genes") Same rule for column or row names
## Take a random sample of N=100 gametes into the next generation & repeat above
## Plot mean, minimum, and maximum fitness over generations
## Plot diversity statistics over generation (including average allelic heterozygosity per locus as well as haplotype heterozygosity). Hint: write two functions for these heterozygosity
# Add natural selection (proportional scheme)
## Individuals reproduce with fitness proportional to the total number of 1's.
## Iterate until a population contains at least one individual with all 1's
## Plot mean, minimum, and maximum fitness over generations
## Plot diversity (expectation: decreasing)
# Add natural selection (tournament scheme)
## Randomly selecting n=5 individuals and allow the fittest one to make gametes
## Plot mean, minimum, and maximum fitness over generations (Expectation: faster optimization)
## Plot diversity (expectation: decreasing fasters)
# Add crossover
## Hint: write a crossover function
## Does it reach optimization faster?
# Code submissions
## [https://github.com/JohnDi0505/MonteCarlo_Simulation-Biostats/blob/master/OneMax%20Optimization.ipynb Python Notebook by John]
## [http://rpubs.com/weigang/407979 R code by Weigang]
## [http://rpubs.com/ChrisNSP/408534 R code by Panlasigui]
## [https://github.com/Inhenn/Toy_Problem_Bio/blob/master/Bio_Toy.ipynb Python Notebook by Yinheng]
## [http://rpubs.com/desiree/409200 RPub by Desiree]
==Week 3. Python & R Packages for evolutionary computation==
* [http://deap.readthedocs.io/en/master/ DEAP: a Python package for Evolutionary Computing]
** Look under [http://deap.readthedocs.io/en/master/examples/index.html Examples] to repeat the OneMax code
* [https://cran.r-project.org/web/packages/GA/vignettes/GA.html GA: An R package for genetic programming]
** Reference 1 (Examples in Section 4). [https://www.jstatsoft.org/v53/i04/ Scrucca, L. (2013) GA: A Package for Genetic Algorithms in R.]
** Reference 2 (for Advanced applications). [https://journal.r-project.org/archive/2017/RJ-2017-008 Scrucca, L. (2017) On some extensions to GA package: hybrid optimisation, parallelisation and islands evolution. ]
** OneMax code & plots (Weigang)
** Example 4.1a. One variable optimization: f(x) = |x| + cos(x) (Muhammad and Desiree)
** Example 4.1b. One variable optimization: f(x) = (x2 + x) cos(x)
** Example 4.2.  Two-parameter optimization: f(x;y) = 20 + x^2 + y^2 * 10(cos(2x) + cos(2y)) (Muhammad and Desiree)
** Example 4.3.  Curve-fitting: tree growth
** Example 4.7.  Constrained optimization: Knapsack Problem
** Example 4.8.  Combinatorial optimization: Traveling Salesman Problem (Brian)
** Advanced application 1. Stock portfolio (Hybrid algorithms)
** Advanced application 2. Parallelization
** Advanced application 3. Island model
* Code submissions
** [http://rpubs.com/weigang/410426 rPub for OneMax (by Weigang)]
** One & Two-dimensional functional optimization with GA (by Desiree & Mohamud) : [http://rpubs.com/desireepante/411065 Entropy function]
** [https://github.com/Inhenn/Knapsack-Problem-Using-DEAP/blob/master/DEAP3.ipynb OneMax and Knapsack Problem with DEAP (by Yinheng)]
** Traveling Salesman Problem with GA (by Brian)
 
==Week 4. Multiplex Problem==
==Week 5. Genetic Programming==
 
=Season II. Summer 2017 (Theme: Machine Learning)=
==Week 1. Introduction & the backprop algorithm==
[[File:Iris-box3.png|thumbnail]] [[File:Iris-box4.png|thumbnail]]
# Problem: Classification/Clustering/Predication of flower species (a total of 3 possible species in the sample data set) based on four phenotypic traits/measurements
# The "iris" data set: exploratory data analysis with visualization & descriptive statistics: <code>data("iris"); plot()</code>, <code>summary()</code>, <code>qqnorm(); qqline(); hist()</code>; normalization with <code>scale()</code> (Roy)
# Mathematics of backpropagating errors to neural connections (Oliver)
## Objective/optimization function measuring difference between target (<code>t</code>, expected) and neural activity <code>y</code>: <code>G=Sum{t*log(y)+(1-t)log(1-y)}</code> (this is known as the "cross-entropy" error function; the other alternative is "minimal squared error (MSE)"), which has the gradient in the simple form of <code>g=-(t-y)x, where x is the input</code>. The objective function is minimized when weights are updated by the gradient. Error-minimization by MSE has similar effects but harder to calculate.
## Learning algorithm is presented
 
==Week 2. Traditional approaches to multivariate clustering/classification==
# Dimension reduction with Multidimensional Scaling <code>cmdscale()</code> [http://rpubs.com/meibyderp/281616 Mei's rNoteBook].
# Dimension reduction with Principal Component Analysis<code>princomp()</code>. [http://rpubs.com/ssipa/281715 Sipa's rNoteBook]
# Multivariate clustering with Hierarchical Clustering <code>hclust()</code> (Saymon)
# Multivariate clustering with k-means <code>kmeans()</code> [http://rpubs.com/roynunez/281963 Roy's rNoteBook]
# Classification based on logistic regression <code>glm()</code>, linear discriminatory analysis <code>lda()</code>, & k-nearest neighbor <code>library(class); knn()</code> (John)
# Modern, non-linear classifiers: Decision Trees (DT), Support Vector Machines (SVM), and Artificial Neural Networks (ANN) (Brian)
 
==Week 3. Single-neuron classifier==
# Algorithm: [[File:Ml-image-1a.jpg|thumbnail]]
## Read input data with <code>N=150 flowers</code>. Reduce to a matrix <code>x</code> with two traits (use trait 1 & 3) and two species (use rows 51-150, the last two species, skip the first species [easy to separate]) for simplicity. Create a target vector <code>t <- c(rep(0,50),rep(1,50))</code> indicating two species
## Initialize the neuron with two random weights <code>w<-runif(2, 1e-3, 1e-2)</code> and one random bias <code>b<-runif(1)</code>
## Neuron activation with <code>k=2</code> connection weights: <code>a=sum(x[k] * w[k])</code>
## Neuron activity/output: <code>y=1/(1+exp(-a-b))</code> (This logistic function ensures output values are between zero and one)
## Learning rules: learning rate <code>eta=0.1</code>, backpropagate error ("e") to get two updated weights (for individual <code>i</code>, feature <code>k</code>): <code>e[i]=t[i]-y[i]; g[k,i]= -e[i] * x[k,i]; g.bias[i] = -e[i] for bias</code>; Batch update weights & bias: <code>w[k]=w[k] - eta * sum(g[k,i]); b = b - eta * sum(g.bias[i])</code> (same rule for the bias parameter <code>b</code>); Repeat/update for <code>L=1000 epochs (or generations)</code>
## Output: weights and errors for each epoch
# Evaluation:
## Plot changes of weights over epoch
## Use the last weights and bias to predict species
## Compare prediction with target to get accuracy
## Plot scatter plot (x2 vs x1) and add a line using the weights & bias at epoch=1,50,100,200,500, 1000: <code>a=w1*x1 + w2*x2 + b; a=0</code>. The lines should show increasing separation of the two species.
# Code submissions
## Roy: [http://rpubs.com/roynunez/283699 rPubs notebook]
## John: [https://github.com/JohnDi0505/MonteCarlo_Simulation-Biostats/blob/master/Neuron_Networks/Single_Neuron_Classification.py Python code on github]
## Mei: [http://rpubs.com/meibyderp/284498 rPubs notebook]
## Brian: [http://rpubs.com/drtwisto/286938 rPubs notebook]
## Weigang: [http://rpubs.com/weigang/279898 rPubs notebook]
# Questions for future exploration
## How to avoid over-fitting by regularization (a way to penalize model complexity by adding a weight decay parameter <code>alpha</code>)
## Bayesian confidence interval of predictions (with Monte Carlo simulation)
## Limitations: equivalent to PCA (linear combination of features); adding a hidden layer generalize the neural net to be a non-linear classifier
 
==Week 4. Single-layer, multiple-neuron learner==
[[File:Multiple-neuron-learner.png|thumbnail]]
# Predict all three species (with "one-hot" coding) using all four features: e.g., 100 for species 1, 010 for species 2, and 001 for species 3. Create 3 neurons, each one outputting one digit.
# Use three neurons, each accepts 4 inputs and output 1 activity
# Use softmax to normalize the final three output activities
# Code submissions
## John:
### [http://www.kdnuggets.com/2016/07/softmax-regression-related-logistic-regression.html reference this webpage]
### [https://github.com/JohnDi0505/MonteCarlo_Simulation-Biostats/blob/master/Neuron_Networks/Multiple_Neuron_Classification.py Python code]
### TensorFlow code
## Roy: [http://rpubs.com/roynunez/288023 rpubs notebook]
## Mei: [http://rpubs.com/meibyderp/288400 rPubs Notebook]
## Weigang: [http://rpubs.com/weigang/287025 rPubs notebook]
# Challenge: Implementation with TensorFlow, an Open Source python machine learning library by Google Deep Mind team. Follow [https://www.tensorflow.org/get_started/mnist/beginners this example of softmax regression]
# A biological application for the summer: identify SNPs associated with biofilm/swarming behavior in Pseudomonas
 
==Week 5. Multi-layer ("Deep") Neuron Network==
[[File:Iris-nnet.png|thumbnail]]
[[File:Nnet-overfitting.png|thumbnail|An investigation of under-fitting (at N=1) & over-fitting (at N=3 nodes). N=0 implies linear fitting.]]
# Input layer: 4 nodes (one for each feature)
# Output layer: 3 nodes (one for each species)
# Hidden layer: 2 hidden nodes
# Advantage: allows non-linear classification; using hidden layers ("convoluted neural net") is able to capture high-order patterns
# Implementation: I'm not going to hard-code the algorithm from scratch. The figure was produced using the R package with the following code: <code>library(nnet); library(neuralnet); targets.nn <- class.ind(c(rep("setosa",50), rep("versicolor",50), rep("virginica",50))) # 1-of-N encoding; iris.net <- neuralnet(formula = setosa + versicolor + virginica ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = training.iris, hidden = 2, threshold = 0.01, linear.output = T); plot(iris.net, rep="best")</code>. It achieved 98.0% accuracy.
# Deep neural net allows non-linear, better fitting, but we don't want over-fitting by adding more hidden layers (or more neurons in the hidden layer). Identify under- and over-fitting with the following procedure:
## Randomly sample 100 as training set and the remaining as target. Repeat 100 times
## For each sample, plot accuracy for the training set, as well as accuracy for the target
## Find the point with the right balance of under- and over-fitting
 
==Week 6. Unsupervised Neural Learner: Hopfield Networks==
# Hebbian model of memory formation (MacKay, Chapter 42). Preparatory work:
## Construct four memories, each for a letter ("D", "J", "C", and "M") using a 5-by-5 grid, with "-1" indicating blank space, and "1" indicating a pixel. Flatten the 5-by-5 matrix to a one-dimensional 1-by-25 vector (tensor).
## Write two functions, one to show a letter <code>show.letter(letter.vector)</code>, which draws a pixel art of letters (print a blank if if -1, a "x" if 1), another to mutate the letter <code>mutate(letter.vector, number.pixel.flips)</code>
# [https://en.wikipedia.org/wiki/Hopfield_network Hopfield Network]
## Store the four memories into a weight matrix, which consists of symmetric weights between neurons i and neuron j, i.e., w[i,j] = w[j,i]. We will use a total of 25 neurons, one for each pixel.
## First, combine the four vectors (one for each letter) into a matrix: <code>x <- matrix(c(letter.d, letter.j, letter.c, letter.m), nrow = 4, byrow = T)</code>
## Second, calculate weight matrix by obtaining the outer product of a matrix multiplication: <code>w <- t(x) %*% x</code>. Set diagonal values to be all 0's (i.e., to remove all self-connections): <code>for (i in 1:25) { w[i,i] = 0 }</code>. This implements [https://en.wikipedia.org/wiki/Hebbian_theory Hebb learning], which translates correlations into strength of connections quantified by weights: large positive weights indicate mutual stimulation (e.g., 1 * 1 = 1 [to wire/strengthen the connection of co-firing neurons, and ...], -1 * -1 = 1 [to wire/strengthen the connection of co-inhibitory neurons as well]), large negative weights indicate mutual inhibition (e.g., 1 * -1 = -1 [to unwire/disconnect oppositely activated neurons]), and small weights indicate a weak connection (e.g., 0 * 1 = 0 [to weaken connections between neurons with uncorrelated activities, but do not unwire them]). (0, 1, and -1 being values of neuron activities)
## Third, implement the learning rule by updating activity for each neuron, sequentially (asynchronously): <code>a[i] <- sum(w[i,j] * x[j]) </code>
## Iterate the previous step 1-5 times, your code should be able to (magically) restore the correct letter image even when the letter is mutated by 1-5 mutations in pixels. This exercise simulates the error-correction ability of memory (e.g., self-correcting encoding in CDs, a neon-light sign missing a stroke, or our ability to read/understand/reconstruct texts with typos, e.g., you have no problem reading/understanding this sentence: "It deosn’t mttaer in what order the ltteers in a word are, the olny iprmoatnt thing is taht the frist and lsat ltteer be in the rghit pclae.").  
## Expected capacity of a Hopfield Network: <code>number_of_memories = 0.138 * number_of_neurons</code>. In our case, 25 (neurons) *0.138 = 3.45 memorized letters. So the network/memory fails if we squeeze in one additional letter (try it!).
# Code submissions
## Roy: [http://rpubs.com/roynunez/288433 rPubs Notebook]
## John: [https://github.com/JohnDi0505/MonteCarlo_Simulation-Biostats/blob/master/Neuron_Networks/Unsupervised_ML_Hopfield_Networks.ipynb Python Code on Github]
## Weigang: [http://rpubs.com/weigang/281806 rPubs Notebook]
# Potential biological applications: genetic code optimization; gene family identification
 
==Week 7. Biological Applications==
A conceptual map for choosing ML algorithms: [[File:Ml map.png|thumbnail|Conceptual map from SciKit-Learn website]]
* The Python SciKit Learn framework may be tool of choice (instead of R). [http://scikit-learn.org/stable/auto_examples/index.html See these nice examples] 
* Bayesian Network (using e.g., R Package <code>bnlearn</code>) to identify cell-signaling networks. Sache et al (2005). [http://science.sciencemag.org/content/308/5721/523.long Causal Protein-Signaling Networks Derived from Multiparameter Single-Cell Data]
* Identify proteins associated with learning in mice (Classification & Clustering)[https://archive.ics.uci.edu/ml/datasets/Mice+Protein+Expression UCI Mice Protein Expression Dataset]
* Predict HIV-1 protease cleavage sites (Classification): [https://archive.ics.uci.edu/ml/datasets/HIV-1+protease+cleavage UCI HIV-1 Protease Dataset]
* Lab project 1. Identify SNPs in 50 c-di-GMP pathway genes that are associated with c-di-GMP levels, swarming ability, and biofilm ability in 30 clinical isolates of Pseudomonas aeruginosa. Approach: supervised neural network (with regularization)
* Lab project 2. Identify genetic changes contributing to antibiotic resistance in 3 cancer patients. Approach: whole-genome sequencing followed by statistical analysis
* Lab project 3. Simulated evolution of genetic code. Approach: Multinomial optimization with unsupervised neural network
** An implementation example: [http://www.hoonzis.com/neural-networks-f-xor-classifier-and/ in C# language]
==Summer Project 1. Systems evolution of biofilm/swarming pathway (with Dr Joao Xavier of MSKCC)==
[[File:sim-cor-2.png|thumbnail| <b>Fig.1 .Simulated CDG-correlated SNPs.</b> t-test results: (1) cor=0.8 (strong), t=-5.1142, df=95.596, p=1.619e-06; (2) cor=0.5 (medium), t=-4.7543, df=85.796, p=7.953e-06; (3) cor=0.2 (weak), t=-0.94585, df=79.28, p= 0.3471.]]
[[File:Sim-tree-snp-1.png|thumbnail|<b>Fig.2. Simulated tree-based SNPs.</b> Generated with the APE function: <code>replicate(10, rTraitDisc(tr, states = c(0,1), rate = 100, model = "ER"))</code>]]
# Acknowledgement: NSF Award 1517002
# Explanatory variable (genotypes): Whole-genome data (of ~30 clinical isolates & many experimentally evolved strains) as independent variables
# Explanatory variable (genotypes): ~50 genes related to cyclic-di-GMP synthesis and regulation (~8000 SNPs, ~1700 unique, ~5-8 major groups)
# Response variables (phenotypes): biofilm size, swarming size, antibiotic sensitivity profile, metabolomics measurements, c-di-GMP levels
# Sub-project 1: Database updates (Usmaan, Edgar, Christopher; continuing the work of Rayees and Raymond in previous years)
## c-di-GMP pathway SNPs in the new table "cdg_snp"
## c-di-GMP levels in "phenotype" table
## capture fig orthologs (in progress)
## capture matebolomics data with a new table (in process)
### [http://diverge.hunter.cuny.edu/~weigang/hm_metabolite-dendoclust.html a heatmap of deviation from mean (among strains of the same metabolite)]
### [http://diverge.hunter.cuny.edu/~weigang/matolites-fc-volcano-plot.html a volcano plot (p values based on t-test from group mean among strains of the same metabolite)]
# Sub-project 2. Supervised learning for predicting genes, SNPs associated with phenotypes
## Advantages over traditional regression analysis: ability to discover non-linear correction structure
## Challenges:
### over-fitting (we have only ~30 observations while thousands of SNPs, "curse of dimensionality")
### the "effective" sample size is further discounted/reduced by phylogenetic relatedness among the strains.
## Single neuron, two-levels; by SNP groups (using <code>hclust()</code>); by PCA (linear combination of SNPs); or by linkage groups (looking for homoplasy) (Mei), or by t-SNE (as suggested by Rayees)
## Three neurons, three-levels; by gene (Roy)[http://rpubs.com/roynunez/291340 july-14_nnet_cidigmp.R]
## To Do: (1) predict continuous-value targets; (2) add hidden layers; (3) correct for phylogenetic auto-correlation
# Sub-project 3. Generate simulated data (with Choleski Decomposition) (Weigang; continue the work in previous year by Ishmere, Zwar, & Rayees)
Fig.1. Simulated SNPs associated with cdg levels (w/o tree) [https://stats.stackexchange.com/questions/12857/generate-random-correlated-data-between-a-binary-and-a-continuous-variable inspired by this algorithm]
<syntaxhighlight lang="bash">
# simulate SNP-cdg correlation
r <- 0.2 # desired correlation coefficient
sigma <- matrix(c(1,r,r,1), ncol=2) # var-covariance matrix
s <- chol(sigma) # choleski decomposition
n <- 100 # number of random deviates (data points)
z <- s %*% matrix(rnorm(n*2), nrow=2) # 100 correlated normally distributed deviates with cor(x,y)=r
u <- pnorm(z) # get probabilities for each deviates
snp.states <- qbinom(u[1,], 1, 0.5) # discretize the 1st vector of probabilities into 0/1 with Bernoulli trial
idx.0 <- which(snp.states == 0); # indices for "0"
idx.1 <- which(snp.states == 1); # indices for "1"
# boxplots with stripcharts
boxplot(u[2,] ~ snp.states, main="cor=0.2", xlab="SNP states", ylab="CDG level")
stripchart(u[2,] ~ snp.states, vertical=T, pch=1, method="jitter", col=2,  add=T)
</syntaxhighlight>
Fig.2. Simulated SNPs associated with strain phylogeny
<syntaxhighlight lang="bash">
<syntaxhighlight lang="bash">
dim(hg) # show dimension
# Simulate SNPs on a tree
head(hg) # show top rows
tr <- read.tree("cdg-tree-mid.dnd") # mid-point rooted tree
tail(hg) # show bottom rows
X <- replicate(10, rTraitDisc(tr, states = c(0,1), rate = 100, model = "ER"))
hg.len <- hg$Gene.End - hg$Gene.Start + 1 # create a vector of gene lengths (Tab for auto-completion)
id <- read.table("cdg.strains.txt3", row.names = 1, sep="\t")
hg.len[1] # show first length
par.tr <- plot(tr, no.margin = T, x.lim = 0.03, show.tip.label = F)
hg.len[10] # show the 10th item
text(rep(0.013,30), 1:30, id[tr$tip.label,1], pos = 4, cex=0.75)
hg.len[1:10] # show items 1 through 10
text(rep(0.02,30), 1:30, snps[tr$tip.label], pos=4, cex=0.75)
hg.len[c(1,10)] # show items 1 and 10
add.scale.bar()
hg[1,1] # show item in 1st row, 1 column
abline(h=1:30, col="gray", lty=2)
hg[1,] # show all values for 1st row
</syntaxhighlight>
hg[,1] # show all values for 1st column
To Do/Challenges: (1) simulate multiple SNPs (linear correlation); (2) simulate epistasis (non-linear correlation); (3) simulate phylogenetic correlation
hg[1:10,] # show rows 1 through 10
* Simulation & xgboost code contributed by Mei & Yinheng (January 2018)
hg[,1:7] # show columns 1 through 10
<syntaxhighlight lang="bash">
hg[1:10, 1:7] # a subset
 
hist(hg$Length, br = 200) # plot gene-length distribution. Not normal: mostly genes are short, few very long
#To create a simulated matrix w/ correlated xy variables; NOTE: ONLY 1 correlated x variable
hist(log10(hg$Length), br = 200) # log transformation for non-normally distributed variable
get1Simulated <- function(corco, nstrains, snps){
gene.cts <- table(hg$Chromosome) # count number of genes, separated by chromosomes
  r <- corco/10 # desired correlation coefficient
barplot(gene.cts) # show distribution by a barplot
  sigma <- matrix(c(1,r,r,1), ncol=2) # var-covariance matrix
mean(hg$Length) # not representative, super-long genes carry too much weight to the average length
  s <- chol(sigma) #cholesky decomposition
median(hg$Length) # More representative. Use median for a variable not normally distributed
  n <- nstrains
summary(hg$Length) # Show all quartiles
  z <- s %*% matrix(rnorm(n*2), nrow=2)
hg$Gene.Length <- hg$Gene.End - hg$Gene.Start + 1 # create a new column named "Gene.Length"
  u <- pnorm(z[2,])
boxplot(log10(Length) ~ Chromosome, data = hg) # show gene length by chromosomes
  snp.states <- qbinom(u, 1, 0.5)
write.csv(hg, "hg.csv", row.names = FALSE) # save into a file
 
hg <- read.csv("hg.csv") # read back into R
  known <- t(rbind(z[1,],snp.states))
  rand <- pnorm(matrix(rnorm(nstrains*(snps-1)),nrow = nstrains))
  snp.rand<-qbinom(rand,1,0.5)
  x <- cbind(known, snp.rand)
 
  colnames(x)[1] <- "target"
  colnames(x)[-1] <- paste("feature", seq_len(ncol(x)-1), sep = ".")
 
  return(x)
}
 
#Use this function to artificially create 2 x variables that have relationship w/ Y.
getSimulated.chol <- function(snp1, snp2, nstrains, snps){
  n <- nstrains
 
  btwn <- snp1 * snp2 #this is a suggested value for inter-SNP correlation. the covariance matrix has to be semi-positive infinite in order to carry out the decomposition
 
  covar.m <- matrix(c(1.0, snp1, snp2, snp1, 1.0, btwn, snp2, btwn, 1.0), nrow = 3) #cor-matrix
 
  s <- chol(covar.m)
  z <- s %*% matrix(rnorm(n*3), nrow = 3)
 
  #decretize the 2 variables
  u <- t(pnorm(z[2:nrow(z),]))
  decretize.u <-qbinom(u,1,0.5)
 
  #generate some random discrete variables and combine w/ the 2 correlated variables
  rand <- matrix(rnorm(nstrains*(snps-2)),nrow = nstrains)
  rand.u <- pnorm(rand)
  snp.rand <- qbinom(rand.u,1,0.5)
  x <- cbind(z[1,], decretize.u, snp.rand)
 
  dimnames(x) <- list(c(), c("target", paste("feature", seq_len(ncol(x)-1), sep = ".")))
 
  return(x)
}
 
#This utilizes xgboost function w/o cross-validating the strains and w/ tuned parameters (we think it's the optimal set) tested by Yinheng's python [https://github.com/weigangq/mic-boost/blob/master/micboost/__init__.py getBestParameters] function. Please make sure to have xgboost installed.
runXg <- function(x){
  require(xgboost)
 
  bst <- xgboost(data = x[,2:ncol(x)], label = x[,1], max.depth = 2, eta = .05, gamma = 0.3, nthread = 2, nround = 10, verbose = 0, eval_metric = "rmse")
 
  importance_matrix <- xgb.importance(model = bst, feature_names = colnames(x[,2:ncol(x)]))
 
  return(importance_matrix)
}
 
 
#Same as above but w/ cross validation.
runXg.cv <- function(x){
  require(caret)
  require(xgboost)
 
  ind <- createDataPartition(x[,1], p = 2/3, list = FALSE )
 
  trainDf <- as.matrix(x[ind,])
  testDf <- as.matrix(x[-ind,])
 
  dtrain <- xgb.DMatrix(data = trainDf[,2:ncol(trainDf)], label = trainDf[,1])
  dtest <- xgb.DMatrix(data = testDf[,2:ncol(testDf)], label = testDf[,1])
 
  watchlist <- list(train=dtrain, test=dtest)
 
  bst <- xgb.train(data=dtrain, nthread = 2, nround=10, watchlist=watchlist, eval.metric = "rmse", verbose = 0)
 
  importance_matrix <- xgb.importance(model = bst, feature_names = colnames(x[,2:ncol(x)]))
 
  return(importance_matrix)
 
}
 
#This function returns a master table with different iterated values that finds the best number of predictor variables (x).
getTable <- function(x, importance_matrix, cor){
  # v.names <- c("feature", "SNPs", "strains", "cor_given", "cor_actual", "gain", "cover", "rank")
  v.names <- c("feature", "SNPs", "strains", "cor_given", "cor_actual", "p.value", "gain", "cover", "rank")
  for (i in 1:length(v.names)){
    assign(v.names[i], numeric())
  }
  featNum <- NULL
 
  n <- 0
  new_cor <- rep(cor, length(x)/length(cor))
  for (num in 1:length(x)) {
    featNum <- NULL
    imp_matrix <- importance_matrix[[num]]
    x.iter <- x[[num]]
    x.pv <- getPV(x[[num]])
 
    for (i in 1:2){
      featNum <- c(featNum, grep(paste("feature.",i,"$", sep = ""), imp_matrix$Feature, perl = TRUE))
    }
 
    n <-  n + 1
 
    for(i in 1:2){
      f <- featNum[i]
      feature<-c(feature, imp_matrix$Feature[f])
      SNPs <-c(SNPs, 100)
      strains <- c(strains, nrow(x.iter))
      cor_given <- c(cor_given, new_cor[n])
      cor_actual <-c(cor_actual, cor(x.iter[,1],x.iter[,grep(paste(imp_matrix$Feature[f], "$", sep = ""), colnames(x.iter), perl = TRUE)]))
      p.value <- c(p.value, x.pv[,imp_matrix$Feature[f]])
      gain <- c(gain, imp_matrix$Gain[f])
      cover <- c(cover, imp_matrix$Cover[f])
      rank <- c(rank, f)
    }
  }
 
  x.master <- data.frame(feature, SNPs, strains, cor_given, cor_actual, p.value, gain, cover, rank)
 
  return(x.master)
}
</syntaxhighlight>
</syntaxhighlight>
* Export a PDF or image
 
* Open a new R script, name it as "hg.R"
==Summer Project 2. Whole-genome variants associated with multidrug resistance in clinical Pseudomonas & E.coli isolates (with Dr Yi-Wei Tang of MSKCC)==
* Select commands and save to script
# Acknowledgement: Hunter CTBR Pilot Award
* Retrieve and edit a command by pressing "up" or "down" arrows
# Stage 1: Select patients & strains for whole-genome sequencing by MiSeq, based on drug sensitivities (16 isolates from 5 patients were isolated, tested, and selected by April, 2017)
* Retrieve commands by using the search box on the "History" table
# Stage 2: Genome sequencing (FASTQ files generated, 3 replicates for each isolates; done by June 2017)
* Type q() to quit. Answer "y" to save workspace
# Stage 3: Variant call
* To reload and restore workspace, go to "C:/Users/instructor/Documents/human.genes" and double click on the file "human.gene"
## Reference strains identified using Kraken (Roy)
## VCF generation using cortex_var (Michele and Hanna, led by John)
# Stage 4: Variant annotation
# Stage 5: Statistical analysis
# Stage 6: Web report
 
==Summer Project 3. Origin of Genetic Code==
Tools for testing SGC & evolved codes
* Shuffle code:
<syntaxhighlight lang='bash'>
./shuffle-code.pl
    -f <sgc|code-file> (required)
    -s <1|2|3|4|5> (shuffle by 1st, 2nd, 3rd, aa blocks, and all random)
    -p(olarity; default)
    -h(ydropathy)
    -v(olume)
    -e(iso-electricity)
# Input: 'sgc' (built-in) or an evolved code file consisting of 64 rows of "codon"-"position"
# Output: a code file consisting of 64 rows of "codon" - "aa" (to be fed into ./code-stats.pl)
# Usage examples:
./shuffle-code.pl -f 'sgc' -s 5 # randomly permute SGC
./shuffle-code.pl -f 'evolved-code.txt' -p # evolved code, AA assigned according to polarity (default)
./shuffle-code.pl -f 'evolved-code.txt' -s 5 -h # evolved code, AA assigned according to hydrophobicity, shuffled randomly
</syntaxhighlight>
* Code statistics
<syntaxhighlight lang='bash'>
./code-stats.pl
        [-h]    (help)
        [-f 'sgc (default)|code_file']  (code)
        [-s 'pair (default)|fit|path']  (stats)
        [-p 'pol|hydro|vol|iso', default 'grantham']    (aa prop)
        [-i (ti/tv, default 5)]
        [-b: begin codon ('TTT') -q: panelty for 1st (50); -w: panelty for 2nd (100)]  (options for path)
# Input: 'sgc' (built-in) or a code file consisting of 64 rows of "codon" - "aa"
# Output: codon or code statistics
# Usage examples:
./code-stats.pl # all defaults: print single-mutation codon pairs, grantham distance, for SGC
./code-stats.pl -s 'fit' -p 'pol' # print code fitness according to polarity, for SGC (default)
./code-stats.pl -s 'path' # print tour length for SGC
</syntaxhighlight>
 
<gallery perrow=4>
Sgc-polar.png|SGC: ordered by polarity
Sgc-hydro.png|SGC: ordered by hydropathy
Sgc-vol.png|SGC: ordered by volume
Sgc-iso.png|SGC: ordered by isoelectricity
 
Path-code3.png|An evolved code (by Oliver)
code-john-2.png|An evolved code (by John). Round-trip (last position connected with the first) AA assigned according to polarity gradient & with "TAA" as the 1st.
code-brian-1.png|An evolved code (by Brian). One-way tour (last position NOT connected with the first). AA assigned according to polarity gradient & with "TAA" as the 1st.
code-error-v3.png|code robustness: <font color="red">SGC</font>, <font color="blue">John's simulated code </font>, <font color="orange">Brian's simulated code </font>, histogram: 1000 random codes. X-axis: mean-squared error (standardized) caused by single-nt substitutions. Evolved codes perform better than SGC!
 
Aa.pca.png|PC1 (mostly polarity and hydropathy, anti-correlated) & PC2 (the other two, positively correlated) of 4 AA metrics. PC1 and PC2 could be used for AA ranking as composite variables
tRNA-tree.png|A tRNA gene tree from Aeropyrum pernix (an Archaea). Sequences from [http://trna.bioinf.uni-leipzig.de/DataOutput/ an rRNA database]. Structual alignment available
tRNA-tree-2.png|tRNA gene tree for Pyrococcus horikoshii (another Archaea). Sequences from [http://gtrnadb.ucsc.edu/GtRNAdb2/index.html UCSD tRNA database]. (Not structurally aligned; only fasta seqs available)
tRNA-seq-dist.png|AA distance appears to be correlated with tRNA seq differences (especially at low distances)
</gallery>
# Participants: Oliver, John, Brian
# A representation mimicking TSP (see figure at right)
# Code to calculate total (mutational, ti/tv) path of SGC?
# Code to calculate energy of SGC?
# Randomize amino acid assignment and obtain distributions of path & energy
# Code to minimize total path length and energy?
# How to evolve a robust SGC? mutation bias + polarity/hydropathy + usage
 
=Season I. Spring 2017 (Themes: Dueling Idiots/Digital Dice/Wright Fisher Process)=
=="Coalescence" (Backward simulation of Wright-Fisher process) (Due May 12, 2017)==
[[File:Coalescent-tree.png|thumbnail|([http://raven.iab.alaska.edu/~ntakebay/teaching/programming/coalsim/node1.html Source])]]
[[File:Coalescent-output-1.png|thumbnail]]
We will conclude Spring 2017 season with the coalescence simulation (in summer, we will start experimenting with simulation of systems evolution)
 
Previously, we simulated Wright-Fisher process of genetic drift (constant pop size, no selection) starting from a founder population and end with the present population. This is not efficient because the program has to track each individual in each generation, although the majority of them do not contribute to the present population.
 
Coalescence simulation takes the opposite approach of starting from the present sample of k individuals and trace backward in time to their most recent common ancestor (MRCA). Due to the nature of Poisson process, the waiting time from one coalescence event (at time T) to the next one (at time T-1) is exponentially distributed with a mean of (k choose 2) generations. This process is iterated until the last coalescent event.
 
The classic text on coalescence is [http://home.uchicago.edu/rhudson1/popgen356/OxfordSurveysEvolBiol7_1-44.pdf Richard Hudson's chapter], which includes  (in Appendix) C codes for simulating tree and mutations. Hudson is also the author of <code>ms</code>, the widely used coalescence simulator.
 
Like the <code>ms 10 1 -T</code> command, your code should output a tree of 10 individuals. [A lot tougher than I thought; can't figure it out in R; so I did it in Perl]. Similarly, you could use R, where the <code>ape</code> package has a function called <code>rcoal()</code> that generate a random coalescence tree. Try this command: <code>plot(rcoal(10))</code> (first load library by running <code>library(ape)</code>).
{| class="wikitable sortable mw-collapsible"
{| class="wikitable sortable mw-collapsible"
|- style="background-color:lightsteelblue;"
|- style="background-color:lightsteelblue;"
! Assignment #1. Due 2/9, Thursday (Finalized)
! scope="col" style="width: 300px;" | Submitted Codes
|- style="background-color:powderblue;"
|- style="background-color:powderblue;"
|  
|
# (2 pts) Install R & R Studio on your own computer:
*By Weigang (First draft)
## First, download & install R from this website: https://mirrors.nics.utk.edu/cran/
<syntaxhighlight lang="perl">
## Next, download & install R studio from this website: https://www.rstudio.com/
#!/usr/bin/env perl
# (4 pts) Reproduce the "human.gene" project (follow steps in Demo). Save and print the histogram for gene length & the barplot for number of genes on each chromosome.  Label x and y axes. Show all commands. [Hint: combine the commands and figures into a single WORD document]
# Basic coalescence
# (4 pts) Vector operations.
# First, simulate coalescent events with recursion
## Create a new vector (named as "gene.length") of gene length
# Second, use Bio::Tree to export a newick tree
## Show commands for extracting the first item, first 10 items, items 20 through 30, the 1st, 2nd, and 5th items
use strict;
## Apply the following functions: range(), min(), max(), mean(), var(). [Hint: use help(var), help(min) for help]
use warnings;
use Data::Dumper;
use Algorithm::Numerical::Sample  qw /sample/;
use Math::Random qw(random_exponential);
use Bio::Tree::Tree;
use Bio::Tree::Node;
 
######################
# Initialize
######################
die "Usage: $0 <num-of-samples>\n" unless @ARGV == 1;
my $nsamp = shift @ARGV;
my @samples;
for (my $i=1; $i<=$nsamp; $i++) { push @samples, {id=>$i, parent=>undef, br=>0} }
my $ctr = $nsamp;
my $time = 0;
my @all_nodes;
 
################################################################################
# Simulate events with exponential time intervals between two successive events
#################################################################################
&coal(\@samples, \$ctr, \$time);
#print Dumper(\@all_nodes);
#########################################
# Reconstitute into tree using Bio::Tree
##########################################
my %seen_node;
my $max_id=0;
my @nodes;
foreach (@all_nodes) {
    $max_id = ($_->{parent} > $max_id) ? $_->{parent} : $max_id;
    if ($seen_node{$_->{id}}) { # child exist, previously as a parent, no branch length
        $seen_node{$_->{id}}->branch_length(sprintf "%.6f", $_->{br}); # add branch length
        if ($seen_node{$_->{parent}}) { # parent exist
            $seen_node{$_->{parent}}->add_Descendent($seen_node{$_->{id}});
        } else { # parent new
            my $pa = Bio::Tree::Node->new(-id=>$_->{parent});
            $pa->add_Descendent($seen_node{$_->{id}});
            $seen_node{$_->{parent}} = $pa;
        }
    } else { # child new
        my $new = Bio::Tree::Node->new(-id=>$_->{id}, -branch_length => sprintf "%.6f", $_->{br});
        $seen_node{$_->{id}} = $new;
        if ($seen_node{$_->{parent}}) { # parent exist
            $seen_node{$_->{parent}}->add_Descendent($new);
        } else { # parent new
            my $pa = Bio::Tree::Node->new(-id=>$_->{parent});
            $pa->add_Descendent($new);
            $seen_node{$_->{parent}} = $pa;
        }
    }
}
 
#my $root = $seen_node{$max_id};
my $tree=Bio::Tree::Tree->new(-id=>'coal-sim', -node=>$seen_node{$max_id}, -nodelete=>1);
print $tree->as_text("newick"), "\n";
 
exit;
 
sub coal {
    my $ref_nodes = shift;
    my $ref_ct = shift;
    my $ref_time = shift;
    my $ct = $$ref_ct;
    my @current_nodes = @$ref_nodes;
    my $k = scalar @current_nodes;
    my @new_nodes;
    return unless $k > 1;
    my @pair = sample(-set => $ref_nodes, -sample_size => 2);
#   print $ct, "\t", $$ref_time, "\t", $pair[0]->{id}, "\t", $pair[1]->{id}, "\n";
    $$ref_time += random_exponential(1, 2/$k/($k-1));
    my $new_nd = {id=>$ct+1, parent=>undef, br=>$$ref_time};
    map {$_->{parent} = $ct+1} @pair;
    map {$_->{br} = $$ref_time - $_->{br}} @pair;
    push @all_nodes, $_ for @pair;
    foreach (@current_nodes) {
        push @new_nodes, $_ unless $_->{id} == $pair[0]->{id} || $_->{id} == $pair[1]->{id};
    }
    push @new_nodes, $new_nd;
    $$ref_ct++;
    &coal(\@new_nodes, $ref_ct, $ref_time);
}
</syntaxhighlight>
*By John
<syntaxhighlight lang="python">
from Bio import Phylo
from io import StringIO
from random import sample
from scipy.misc import comb
from itertools import combinations
from numpy.random import exponential as exp
 
# Generate Random Tree with Nodes & Waiting Time
individuals = ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J"]
node_dict = {}
reference = {}
wait_time_ref = {}
node = 0
while len(individuals) != 1:
    node_dict[node] = {"child_nodes": []}
    sample_events = tuple(sample(individuals, 2))
    reference[sample_events] = node
    wait_time = exp(1/comb(len(individuals), 2))
    wait_time_ref[node] = wait_time
    for sample_event in sample_events:
        if type(sample_event) == str:
            node_dict[node]["child_nodes"].append(sample_event)
        else:
            child_node = reference[sample_event]
            node_dict[node]["child_nodes"].append(child_node)
    individuals.remove(sample_events[0])
    individuals.remove(sample_events[1])
    individuals.append(sample_events)
    node += 1
   
# Calculate Tree Branch Lengths
tree_dict = {}
cumulative_br_length = 0
cumulative_br_len_dict = {}
for i in range(len(node_dict)):
    tree_dict[i] = {}
    cumulative_br_length += wait_time_ref[i]
    cumulative_br_len_dict[i] = cumulative_br_length
    for node in node_dict[i]['child_nodes']:
        if type(node) == str:
            tree_dict[i][node] = cumulative_br_length
        else:
            tree_dict[i][node] = cumulative_br_len_dict[i] - cumulative_br_len_dict[node]
           
# Parse the Tree into a String
for i in range(len(tree_dict)):
    for node in tree_dict[i].keys():
        if type(node) != str:
            temp = str(tree_dict[node])
            tree_dict[i][temp] = tree_dict[i].pop(node)
tree_str = str(tree_dict[i]).replace('\'', '').replace('\"', '').replace('\\', '').replace('{', '(').replace('}', ')').replace(' ', '')
tree_str += ";"           
 
# Visualize Tree
handle = StringIO(tree_str)
tree = Phylo.read(handle, 'newick')
tree.ladderize()   # Flip branches so deeper clades are displayed at top
Phylo.draw(tree)
</syntaxhighlight>
|}
|}


===Feb 9. (<font color="red">Class cancelled due to snow storm</font>)===
==April 17, 2017 "Mutation Meltdown" (Due May 5, 2017)==
[[File:Ratchet-1.png|thumbnail]]
Our last exploration showed that population will reach a steady-state level of DNA sequence polymorphism under the opposing forces of genetic drift and mutations. The steady-state level is expected to be ~ N * mu: the larger the population size and the higher the mutation rate, the higher per-site DNA polymorphism. (Also note that, given a long enough sequence and a low enough mutation rate, we don't expect to see two or more mutations hitting a single position. Each mutation is essentially a new one and only two-state SNPs are expected in a sample of DNA sequences).
 
However, the steady-state expectation is based on the assumption that all mutations are neutral (i.e., no beneficial or harmful fitness effects). In reality, mutations are predominantly either neutral or harmful and few are beneficial. Natural selection (negative or positive, except those on the immune-defense loci) tends to drive down genetic variation.
 
Sex to the rescue. Without sex or recombination, the fate of genetic variations across a genome are bundled together, rising or sinking in unison with the fate of a single beneficial or harmful mutation. With recombination, genetic variations at different loci become less tightly linked and are more likely maintained, speeding up adaptation.
 
This week, we will use simulation to recreate the so-called "[https://en.wikipedia.org/wiki/Muller%27s_ratchet Muller's Ratchet]", which predicts that asexual populations are evolutionary dead ends. It shows that the combined effect of deleterious mutation and genetic drift will lead to population extinction due to steady accumulation of deleterious mutations, a phenomenon called "mutation meltdown".
 
The simulation protocol would be similar to that of the last problem involving only mutation and drift, but I suggest you rewrite from scratch to be more efficient by using a simpler data structure (no sequence or bases needed). The main difference is that, instead of calculating average pairwise sequence differences in each generation, you will track the frequency of mutation-free sequences (n0) and find the time until it goes to zero. That is when the Ratchet makes a "click". The next click is when the population losses the one-mutation sequences (n1), and so on. Each click drives the population fitness down by one mutation-level. [ Numerically, if each mutation causes a fitness loss of <code>s</code> ("selection coefficient"), the fitness of a sequence with <code>k</code> mutations is given by w = (1-s)<sup>k</sup>. Strictly speaking, the selection coefficient should be included to affect gamete size, but let's ignore that for now].
 
If there is recombination, the mutation-free sequences could be recovered (simulation next time?). Without recombination, the only direction for the population to evolve is a steady loss of best-fit individuals (by genetic drift).
 
Questions:
# Does the ratchet occur faster in small or large populations? Find by simulating N=100 and N=1000
# Does the ratchet occur faster for a short or long genome? Find by simulating L=1e3 and L=1e4
# Which parts of the human genomes are asexual (therefore subject to Muller's Ratchet and becoming increasingly small)?
{| class="wikitable sortable mw-collapsible"
|- style="background-color:lightsteelblue;"
! scope="col" style="width: 300px;" | Submitted Codes
|- style="background-color:powderblue;"
|
*By John
<syntaxhighlight lang="python">
import numpy as np
import matplotlib.pyplot as plt
from numpy.random import poisson
from numpy.random import choice
 
def simulator(seq_length, pop, repro_rate, generations):
    mutation_rate = 0.00001
    lam = seq_length * mutation_rate
    individuals = np.array([1 for i in range(pop)])
    non_mutated_pop_rate = []
    for generation in range(generations):
        gametes = []
        for individual in individuals:
            off_sprs = poisson(lam, repro_rate)
            mutation_ix = np.array(np.where(off_sprs != 0))
            indiv_gametes = np.array([individual for j in range(repro_rate)])
            indiv_gametes[mutation_ix] = 0
            gametes += list(indiv_gametes)
        individuals = choice(gametes, size=pop)
        non_mutated_frequency = float(np.count_nonzero(individuals)) / pop
        non_mutated_pop_rate.append(non_mutated_frequency)
    return non_mutated_pop_rate
 
genome_length_1000 = 1000
genome_length_10000 = 10000
results_1000 = simulator(genome_length_1000, 1000, 100, 500)
results_10000 = simulator(genome_length_10000, 1000, 100, 500)
 
genome_length_1000 = 1000
genome_length_10000 = 10000
results2_1000 = simulator(genome_length_1000, 10000, 100, 500)
results2_10000 = simulator(genome_length_10000, 10000, 100, 500)
 
f, (ax1, ax2) = plt.subplots(1, 2, sharey=True)
line_1_1000 = ax1.plot(results_1000, "r")
line_1_10000 = ax1.plot(results_10000, "b")
line_2_1000 = ax2.plot(results2_1000, "r")
line_2_10000 = ax2.plot(results2_10000, "b")
 
ax1.set_title("Population = 1000")
ax2.set_title("Population = 10000")
 
plt.title("Mutation Meltdown")
plt.show()


===Feb 16. R Data Structure & Variable Types===
* Vector
<syntaxhighlight lang=R">
x <- c(1,2,3,4,5) # construct a vector using the c() function
x # show x
2 * x + 1 # arithmetic operations, applied to each element
exp(x) # exponent function (base e)
x <- 1:5 # alternative way to construct a vector, if consecutive
x <- seq(from = -1, to = 14, by = 2) # use the seq() function to create a numeric series
x <- rep(5, times = 10) # use the rep() function to create a vector of same element
x <- rep(NA, times = 10) # pre-define a vector with unknown elements; Use no quotes
# Apply vector functions
length(x)
sum(x)
mean(x)
range(x)
# Access vector elements with indices
x[1]
x[1:3]
x[-2]
x[c(1,3)]
# Character vectors
gender <- c("male", "female", "female", "male", "female")
gender[3]
</syntaxhighlight>
</syntaxhighlight>
* Matrix
 
*By Weigang
<syntaxhighlight lang=R">
<syntaxhighlight lang=R">
BMI <- c(28, 32, 21, 27, 35) # a vector of body-mass index
add.mutation <- function(genome, genome.length, mutation.rate) {
bp <- c(124, 145, 127, 133, 140) # a vector of blood pressure
  mu.exp <- genome.length * mutation.rate;
data.1 <- cbind(age, BMI, bp) # create a matrix using column bind function cbind(), individuals in rows
  mu.num <- rpois(1, lambda = mu.exp);
data.1
  return(genome+mu.num);
data.2 <- rbind(age, BMI, bp) # create a matrix using row bind function rbind()
}
t(data.1) # transpose a matrix: columns to rows & rows to columns
 
dim(data.1) # dimension of the matrix
ratchet <- function(pop.size=100, gamete.size=100, mu=1e-4, genome.length=1e4) {
colnames(data.1)
  out <- data.frame(generation=numeric(), n0=numeric(), n1=numeric(), pop=numeric(), length=numeric());
rownames(data.1) <- c("subject1", "subject2", "subject3", "subject4", "subject5")
  pop <- rep(0, pop.size) # initial all mutation-free
data.1
  g <- 1; # generation
data.1[3,1] # access the element in row 3, column 1
  freq0 <- 1; # freq of zero-class
data.1[2,] # access all elements in row 2
  while(freq0 > 0) {
data.1[,2] # access all elements in column 2
    cat("at generation", g, "\n");
matrix(data = 1:12, nrow = 3, ncol =4) # create a matrix with three rows and four columns; filled by column
    freq0 <- length(which(pop == 0));
matrix(data = 1:12, nrow = 3, ncol =4, byrow = TRUE) # filled by row
    out <- rbind(out, data.frame(generation=g, n0=freq0/pop.size, pop=pop.size, length=genome.length));
mat <- matrix(data = NA, nrow = 2, ncol = 3) # create an empty matrix
    gametes <- sapply(1:gamete.size, function(x) {lapply(pop, function(x) add.mutation(x, genome.length, mu))});
mat[1,3] <- 5 # assign a value to a matrix element
    pop <- sample(gametes, pop.size);
</syntaxhighlight>
    g <- g+1;
* Dataframe
  }
<syntaxhighlight lang=R">
  return(out);
class(hg) # show object class
}
hg[1,] # how first row
 
hg[,1] # show first column
out.long.genome.df <- ratchet(genome.length=1e4);
hg[1:3,] # show rows 1 through 3
out.short.genome.df <- ratchet(genome.length=1e3);
hg[,1:3] # show columns 1 through 3
hg$Gene.Name # show column "Gene.Name"
</syntaxhighlight>
</syntaxhighlight>
|}
==March 31, 2017 "Drift-Mutation Balance" (4/14/2017)==
[[File:Drift-mutation.png|thumbnail]]
We will explore genetic drift of a DNA fragment with mutation under Wright-Fisher model. From last week's exercise, we conclude that a population will sooner or later lose genetic diversity (becoming fixed after ~2N generations), if no new alleles are generated (by e.g., mutation or migration).
Mutation, in contrast, increases genetic diversity over time. Under neutrality (no natural selection against or for any mutation, e.g., on an intron sequence), the population will reach an equilibrium point when the loss of genetic diversity by drift is cancelled out by increase of genetic diversity by mutation.
You job is to find this equilibrium point by simulation, given a population size (N) and a mutation rate (mu). The expected answer is pi=2N*mu, where pi is a measure of genetic diversity using DNA sequences, which is the average pairwise sequence differences within a population.
<font color="blue">Note: the following algorithm seems to be too complex to build at once. Also, too slow to run in R. Nonetheless, please try to write two R functions: (1)mutate.seq(seq, mut), with "seq" as a vector of bases and "mut" is the mutation rate, and (2) avg.seq.diff(pop), with "pop" as a list of DNA sequences</font>
A suggested algorithm is:
# Start with a homogeneous population with N=100 identical DNA sequences (e.g., with a length of L=1e4 bases, or about 5 genes) [R hint: <code>dna <- sample(c("a", "t", "c", "g"), size=1e5, replace=T, prob = rep(0.25,4))</code>]
# Write a mutation function, which will mutate the DNA based on Poisson process (since mutation is a rare event). For example, if mu=1e-4 per generation per base per individual (too high for real, but faster for results to converge), then each generation the expected number of mutations would be L * mu = 1 per individual per generation for our DNA segment. You would then simulate the random number of mutations by using the R function <code>num.mutations <- rpois(1,lambda=1)</code>.
# Apply the mutation function for each individual DNA copy (a total of N=100) during gamete production (100 gametes for each individual) at each generation (for a total of G=1000 generations).
# Write another function to calculate, for each generation, instead of counting allele frequencies (as last week's problem), to calculate & output average pairwise differences among the 100 individuals.
# Finally, you would graph pi over generation.
{| class="wikitable sortable mw-collapsible"
{| class="wikitable sortable mw-collapsible"
|- style="background-color:lightsteelblue;"
|- style="background-color:lightsteelblue;"
! Assignment #2. Due 2/23 (Finalized)
! scope="col" style="width: 300px;" | Submitted Codes
|- style="background-color:powderblue;"
|- style="background-color:powderblue;"
|  
|  
# (2 pts) Construct a numeric vector of 10 random numbers from the uniform distribution between 0 and 1 (Hint: use the function <code>runif()</code>). Name the resulting vector as "ran.1". Show length, range, mean, and variance.
* By John
# (2 pts) Construct a numeric vector of 10 random numbers from a normal distribution with mean of 0 and variance of 1 (Hint: use the function <code>rnorm()</code>). Name the resulting vector as "ran.2". Show length, range, mean, and variance. Compare the variance with the previous one: which is large?
<syntaxhighlight lang="python">
# (2 pts) Construct a matrix of 10 rows by combining the previous two vectors using the <code>cbind</code> function. Name the matrix as "mat". Assign row names as "ind1" .. "ind10". Show row values for ind1, column values for ran.2; transpose the matrix and save it as "mat.t".
import numpy as np
# (2 pts) Construct a character vector of 10 US States (Hint: use the c() function). Name it "us.states". Use full, case-sensitive names and "_" in place of spaces. Show the first and the fifth states with one command.
import matplotlib.pyplot as plt
# (2 pts) Understand variable types. (From Whitlock & Schluter) Researchers randomly assign diabetes patients to two groups. In the first group, the patients receive a new drug while the other group received standard treatment without the new drug. The researchers compared the rate of insulin release in the two groups.
from itertools import combinations
## List the two variables and state whether each is categorical (if so, whether it is nominal or ordinal) or numerical (if so, whether it is discrete or continuous)
from random import sample
## State & explain which variable is the explanatory (i.e., predictive) and which is the response variable.
from numpy.random import choice
|}
from numpy.random import poisson
 
# global variables
nuc = np.array(["A", "T", "C", "G"])
expected_mutation_rate = 0.00001 # there are 2 sequences per individual
seq_length = 10000
individual_pop = 100
gamete_rate = 100
generation = 1000
 
def simulator(population, repro_rate, generation):
    observed_mutation_collection = []
   
    # Original Individuals
    original = choice(nuc, seq_length, 0.25)
    individual_total = [original for i in range(population)]
   
    for i in range(generation): # iterate over 1000 generations
        # Produce Gametes
        gametes = []
        for individual in individual_total:
            gametes += [individual for i in range(repro_rate)]
        gametes = np.array(gametes)
        gametes_pop = gametes.shape[0] # number of gametes: 100 * 100
       
        # Mutation
        mutation_number_arr = poisson(lam=seq_length * expected_mutation_rate, size=gametes_pop) # derive number of mutations per individual
        mutation_index_arr = [sample(range(seq_length), mutations) for mutations in mutation_number_arr] # get the index of mutation
        # Mutation: Replace with Mututated Base Pairs
        for i in range(gametes_pop): # iterate over all gametes
            for ix in mutation_index_arr[i]: # iterate over mutated base pair for each gamete
                gametes[i, ix] = choice(nuc[nuc != gametes[i, ix]], 1)[0] # locate mutation and alter the nuc with mutated one
       
        # Next generation of individuals
        individual_total = gametes[choice(range(10000), 100, replace=False)]
       
        # Calculate observed mutation rate
        num_combinations = 0
        total_diff_bases = 0
        for pair in combinations(range(len(individual_total)), 2): # aggregate mutations for all possible pairs of individuals
            total_diff_bases += len(np.where((individual_total[pair[0]] == individual_total[pair[1]]) == False)[0])
            num_combinations += 1
        observed_mutation_rate = float(total_diff_bases) / float(num_combinations) / seq_length
        observed_mutation_collection.append(observed_mutation_rate)
       
    return(observed_mutation_collection)
 
# Simulation
result = simulator(100, 100, 1000)
 
# Visualize results
plt.plot(result, 'b')
plt.xlabel("Generation")
plt.ylabel("Mutation Rate")
plt.title("Drift-Mutation Balance")
plt.show()
</syntaxhighlight>


===Feb 23. Data Visualization===
*By Weigang
* Vector functions returning indices
<syntaxhighlight lang="bash">
<syntaxhighlight lang="bash">
# The which() function returns the indices of TRUE elements
library(Biostrings) # a more compressed way to store and manipulate sequences
hg$Gene.Length <- hg$Gene.End - hg$Gene.Start + 1 # add a length column
bases <- DNA_ALPHABET[1:4];
hg.long.idx <- which(hg$Gene.Length > 1e6) # returns indices
dna <- sample(bases, size = 1e5, replace = T);
hg.long <- hg[hg.long.idx,] # genes longer than 1 million bases
library(ape) # to use the DNAbin methods
hg.mt.idx <- which(hg$Chromosome == "MT")
# function to mutate (Poisson process)
hg.mt <- hg[hg.mt.idx,] # mitochondrial genes
mutate.seq <- function(seq, mutation.rate) {
# The grep() function returns the indices of matching a pattern
  mu.exp <- length(seq) * mutation.rate;
p53.idx <- grep("P53", hg$Gene.Name)
  mu.num <- rpois(1, lambda = mu.exp);
hg.p53 <- hg[p53.idx,]
  if (mu.num > 0) {
# The order() function returns the indices of sorted elements
    pos <- sample(1:length(seq), size = mu.num);
idx.sorted <- order(hg$Gene.Length)
    for (j in 1:length(pos)) {
hg.sorted <- hg[idx.sorted,]
      current.base <- seq[pos[j]];
      mutated.base <-sample(bases[bases !=current.base)], size = 1);
      seq[pos[j]] <- mutated.base;
    }
  }
  return(seq)
}
 
# main routine
drift.mutation <- function(pop.size=100, generation.time=500, gamete.size=10, mu=1e-5) {
  out <- data.frame(generation=numeric(), pi=numeric()); # for storing outputs
  pop <- lapply(1:pop.size, function(x) dna)  # create the initial population
  for(i in 1:generation.time) { # for each generation
    cat("at generation", i, "\n"); # print progress
    gametes <- sapply(1:gamete.size, function(x) {lapply(pop, function(x) mutate.seq(x, mu))});
    pop <- sample(gametes, pop.size);
    pop.bin <- as.DNAbin(pop); # change into DNAbin class to advantage of its dist.dna() function
    out <- rbind(out, data.frame(generation=i, pi=mean(dist.dna(pop.bin))));
  }
  out;
}
out.df <- drift.mutation(); # run function and save results into a data frame
 
</syntaxhighlight>
</syntaxhighlight>
* Textbook: Chapter 1. Statistics and Sample; Chapter 2. Displaying Data
|}
* Lecture slides: [[File:Lecture-slides-part-1.pdf|thumbnail]]
 
==March 18, 2017 "Genetic Drift" (Due 3/31/2017)==
[[File:Drift.png|thumbnail]]
[[File:Drift-founder.png|thumbnail]]
This is our first biological simulation. Mandatory assignment for all lab members (from interns to doctoral students). An expected result is shown in the graph.
 
Task: Simulate the Wright-Fisher model of genetic drift as follows:
# Begin with an allele frequency p=0.5, and a pop of N=100 haploid individuals [Hint: <code>pop <- c(rep("A", 50), rep("a", 50))</code>]
# Each individual produces 100 gametes, giving a total of 10,000 gametes [Hint: use a for loop with rep() function]
# Sample from the gamete pool another 100 to give rise to a new generation of individuals [Hint: use sample() function]
# Calculate allele frequency [Hint: use table() function]
# Repeat the above in succession for a total generation of g=1000 generations [Hint: create a function with three arguments, e.g., wright.fisher(pop.size, gamete.size, generation.time)]
# Plot allele frequency changes over generation time
# Be prepared to answer these questions:
## Why allele frequency fluctuate even without natural selection?
## What's the final fate of population, one allele left, or two alleles coexist indefinitely?
## Which population can maintain genetic polymorphism (with two alleles) longer?
## Which population gets fixed (only one allele remains) quicker?
 
{| class="wikitable sortable mw-collapsible"
{| class="wikitable sortable mw-collapsible"
|- style="background-color:lightsteelblue;"
|- style="background-color:lightsteelblue;"
! Assignment #3. Due 3/2 (Finalized)
! scope="col" style="width: 300px;" | Submitted Codes
|- style="background-color:powderblue;"
|- style="background-color:powderblue;"
|  
|
# (1 pt) Load the iris data set by typing <code>data(iris)</code>
* By Lili (May, 2019)
# (1 pt) Identity a character variable and obtain frequency counts using the "table()" function
<syntaxhighlight lang="bash">
# (1 pt) Identity a numerical variable and obtain frequency distribution by a histogram. Use customized x-axis label
#Simple sampling, haploid, as a function
# (2 pts) Make a boxplot of distribution of each numerical variable with respect to species. For example, <code>boxplot(Sepal.Length ~ Species, data=iris)</code>. Label axes and title.
wright_fisher <- function(pop_size, gam_size, alle_frq, n_gen) {
# (2 pt) Make a strip chart of distribution of a numerical variable with respect to species using the <code>stripchart()</code> function. Customize it with axes labels, open circle symbol (pch = 1), the method of "jitter", and being vertical ("vertical=T").
  pop <- c(rep("A", pop_size*frq), rep("a", pop_size*frq))
# (1 pt) Make a scatter plot to show relations between two numerical variables. For example, <code>plot(iris$Sepal.Length, iris$Sepal.Width)</code>
  prob <- numeric(n_gen)
# (2 pts) Among graduate school applicants to a university department, 512 males were admitted, 313 males were rejected, 89 females were admitted, and 19 females were rejected. Explore if there is gender bias in admission by
  for(time in 1:n_gen){
## Identify the explanatory and response variables, as well as whether the variables are character or numerical
    gamt <- rep(pop, gam_size)
## Make a contingency table using the matrix() function. Add labels to columns and rows using colnames() and rownames() functions
    pop <- sample(gamt, pop_size)
## Plot the contingency table using grouped bar plot with the "barplot()" function, and the "beside = T" option.
    prob[time] <- table(pop)[1]/pop_size
## Plot the contingency table using the mosaicplot() function. Based on the plot, explain if there is evidence for gender bias. [Hint: try matrix transposition]
  }
|}
  wf_df <- data.frame(generation=1:n_gen, probability=prob)
  plot(drift_df, type= 'l', main = "Wright-Fisher Process (Genetic Drift)", xlab = "generation", ylab = "allele frequency")
}
 
wright_fisher(1000, 100, 0.5, 1000)
 
#Version 2. Using binomial distribution with replicates, diploid
 
pop_size <- c(50, 100, 1000, 5000)
alle_frq <- c(0.01, 0.1, 0.5, 0.8)
n_gen <- 100
n_reps <- 50
genetic_drift <- data.frame()
 
for(N in pop_size){
  for(p in alle_frq){
    p0 <- p
    for(j in 1:n_gen){
      X <- rbinom(n_reps, 2*N, p)
      p <- X/(2*N)
      rows <- data.frame(replicate= 1:n_reps, pop=rep(N, n_reps), gen=rep(j, n_reps), frq=rep(p0, n_reps), prob=p )
      genetic_drift <- rbind(genetic_drift, rows)
    }
  }
}
 
library(ggplot2)
ggplot(genetic_drift, aes(x=gen, y=prob, group=replicate)) + geom_path(alpha= .5) + facet_grid(pop ~ frq) + guides(colour=FALSE)
 
# 3rd version: trace ancestry (instead of allele frequency)
 
</syntaxhighlight>
 
*By John
<syntaxhighlight lang="python">
#Python
import numpy as np
import sys
from random import sample
import matplotlib.pyplot as plt
def simulator(gametes_rate, next_individuals, generations):
    individuals = [1 for i in range(50)] + [0 for j in range(50)]
    frequency = []
    for generation in range(generations):
        gametes = []
        for individual in individuals:
            gametes += [individual for i in range(gametes_rate)]
        individuals = sample(gametes, next_individuals)
        frequency.append(np.count_nonzero(individuals) / len(individuals))
    return(frequency)
N_100 = simulator(100, 100, 1000)
N_1000 = simulator(100, 1000, 1000)
# Create plots with pre-defined labels.
# Alternatively,pass labels explicitly when calling `legend`.
fig, ax = plt.subplots()
ax.plot(N_100, 'r', label='N=100')
ax.plot(N_1000, 'b', label='N=1000')
# Add x, y labels and title
plt.ylim(-0.1, 1.1)
plt.xlabel("Generation")
plt.ylabel("Frequency")
plt.title("Wright-Fisher Model")
# Now add the legend with some customizations.
legend = ax.legend(loc='upper right', shadow=True)
# The frame is matplotlib.patches.Rectangle instance surrounding the legend.
frame = legend.get_frame()
frame.set_facecolor('0.90')
# Set the fontsize
for label in legend.get_texts():
    label.set_fontsize('large')
for label in legend.get_lines():
    label.set_linewidth(1.5) # the legend line width
plt.show()
</syntaxhighlight>
<syntaxhighlight lang="scala">
#Scala
import scala.util.Random
import scala.collection.mutable.ListBuffer
val gamete_rate = 100
val offspr_rate = 100
val generations = 1000
var frequency: List[Double] = List()
var individuals = List.fill(50)(0) ++ List.fill(50)(1)
for(generation <- 1 to generations){
  var gametes = individuals.map(x => List.fill(gamete_rate)(x)).flatten
  individuals = Random.shuffle(gametes).take(offspr_rate)
  frequency = frequency :+ individuals.count(_ == 1).toDouble / offspr_rate
}
print(frequency)
</syntaxhighlight>
<syntaxhighlight lang="java">
#Spark
val gamete_rate = 100
val offspr_rate = 100
val generations = 300
var frequency: List[Double] = List()
var individuals = sc.parallelize(List.fill(50)("0") ++ List.fill(50)("1"))
for(generation <- 1 to generations){
  val gametes = individuals.flatMap(x => (x * gamete_rate).split("").tail)
  individuals = sc.parallelize(gametes.takeSample(false, offspr_rate))
  val count = individuals.countByValue
  frequency = frequency :+ count("1").toDouble / offspr_rate
}
print(frequency)
</syntaxhighlight>
 
*By Brian
<syntaxhighlight lang="bash">
Wright<-function(pop.size,gam.size,generation) {
if( pop.size%%2==0) {
  pop<-c(rep("A",pop.size*.5),rep("a",pop.size))
  frequency<-.5
  time<-0
  geneticDrift<-data.frame(frequency,time)
  for (i in 1:generation) {
    largePop<-rep(pop,gam.size)
    samplePop<-sample(largePop,100)
    cases<-table(samplePop)
    if (cases[1]<100 && cases[2]<100 )  
        {propA<-(cases[1]/100)
        geneticDrift<-rbind(geneticDrift,c(propA,i))
        pop<-c(rep("A",cases[2]),rep("a",cases[1]))
  } 
}
  plot(geneticDrift$frequency~geneticDrift$time,type="b", main="Genetic Drift N=1000", xlab="time",ylab="Proportion Pop A",col="red", pch=18 )
}
if(pop.size%%2==1 ) {
  print("Initial Population should be even. Try again")
}
}
Wright(1000,100,1000)
</syntaxhighlight>


===March 2. Exam 1===
*By Jamila
<syntaxhighlight lang="bash">
Genetic_code = function(t,R){
  N<- 100
  p<- 0.5
  frequency<-as.numeric();
  for (i in 1:t){
    A1=rbiom(1,2*N,p)
    p=A1/(N*2);
    frequency[length(frequency)+1]<-p;
  }
  plot(frequency, type="1",ylim=c(0,1),col=3,xlab="t",ylab=expression(p(A[1])))
 
  }
</syntaxhighlight>


===March 9. Describing data & Standard Error===
*By Sharon
* Textbook: Chapters 3 & 4
* Population and sample
<syntaxhighlight lang="bash">
<syntaxhighlight lang="bash">
x <- rnorm(1000)
p=0.5
x.sample <- sample(x, size = 100)
N=100
n.genes <- nrow(hg) # number of rows
g=0
sampled.rows <- sample(1:n.genes, size = 100)
pop <- c(rep("A", 50), rep("a", 50)) #create a population of 2 alleles
hg.sample <- hg[sampled.rows,] # a random sample of 100 genes
data<-data.frame(g,p) #a set of variables of the same # of rows
for (i in 1:1000) { #generation
  gam_pl <-rep(pop, 100) #gamete pool
  gam_sam <-sample(gam_pl, 100) #sample from gamete pool
  tab_all<-table(gam_sam) #table ps sample
  all_freq<-tab_all[1]/100 #get allele frequency
  data<-rbind(data, c(i,all_freq))
  pop<-c(rep("A",tab_all[1]))
}
plot(data$g,data$p)
</syntaxhighlight>
</syntaxhighlight>
* Explore variable distributions
 
<syntaxhighlight lang=R">
*By Sipa
x <- rnorm(1000)
<syntaxhighlight lang="bash">
hist(x, breaks = 100) # distribution for continuous variable
pop <- c(rep("A", 50), rep("a", 50))
hist(hg$Gene.Length, br = 100)
wright_fisher <- function(N,gen_time,gam) {
hist(log10(hg$Gene.Length), br = 100)
  N=N
gene.cts <- table(hg$Chromosomes) # distribution for a categorical vector
  gen_time = gen_time
barplot(gene.cts)
  x = numeric(gen_time)
  x[1] = gam
  for (i in 2:1000) {
  k=(x[i-1])/N
  n=seq(0,N,1)
  prob=dbinom(n,N,k)
  x[i]=sample(0:N, 1, prob=prob)
 
}
 
plot(x[1:gen_time], type="l", pch=10)
 
}
pool <- rep(pop, times = 100)
s_pool <- sample(pool, size = 100, replace = F)
table(s_pool)
wright_fisher2 <- function(pop_size,al_freq,gen_time) {
  pop <- c(rep("A", pop_size/2), rep("a", pop_size/2))
  pool <-rep(pop, times=100)
  s_pool <- sample(pool, size = 100, replace = T)
  for(i in 1:gen_time)
    s_pool <- sample(sample(pool, size = pop_size, replace = F))
  a.f <- table(s_pool)[1]/100
  return(a.f)
  }
</syntaxhighlight>
</syntaxhighlight>
* In-Class exercise: A study of human gene length
 
*By Nicolette
<syntaxhighlight lang="bash">
<syntaxhighlight lang="bash">
hg.len <- hg$Gene.End - hg$Gene.Start + 1 # calculate gene length
N=100
hist(hg.len, br = 200) # plot gene-length distribution. Not normal: mostly genes are short, few very long
p=0.5
mean(hg.len) # not representative, super-long genes carry too much weight to the average length
g=1000
median(hg.len) # More representative. Use median for a variable not normally distributed
sim=100
summary(hg.len) # Show all quartiles
Genetic_drift=array(0, dim=c(g,sim))
IQR(hg.len) # 3rd Quartile - 1st Quartile, the range of majority data points, even for skewed distribution
Genetic_drift[1,]=rep(N*p,sim)
log.len <- log10(hg.len); hist(log.len, br=200) # Log of gene length is more normally distributed
for(i in 1:sim) {
mean(log.len); median(log.len) # They should be similar, since log.len is normal
  for(j in 2:g){
# The next block is intend to show that the "mean length" of samples is normally distributed, although the length itself is not
    X[j,i]=rbinom(1,N,prob=X[j-1,i]/N)
samp.len <- sample(hg.len, 100) # take a random sample of 100 length
  }
mean(samp.len) # a sample mean
}
# Repeat the above 1000 times, so we could study the distribution of "mean length" (not "length" itself)
Genetic_drift=data.frame(X/N)
mean.sample.100 <- sapply(1:1000, function(x) mean(sample(hg$Gene.Length, size = 100)))
matplot(1:1000, (X/N), type="l",ylab="allele_frequency",xlab="generations")
hist(mean.sample.100, br=100) # you should see a more normally distributed histogram
# The above exercise is a demonstration of the "Central Limit Theorem"
</syntaxhighlight>
</syntaxhighlight>
* By Weigang
<syntaxhighlight lang="bash">
# output frequency:
wright.fisher <- function(pop.size, generation.time=1000, gametes.per.ind=100) {
  out.df <- data.frame(gen=numeric(), freq=numeric(), pop.size=numeric());
  pop <- c(rep("A", pop.size/2), rep("a", pop.size/2)); # initial pop
  for(i in 1:generation.time) {
    gametes <- unlist(lapply(pop, function(x) rep(x, gametes.per.ind)));
    pop <- sample(gametes, pop.size); # sample gamete pool without replacement
    freq.a <- table(pop)[1]/pop.size; # frequency of "a"
    out.df <- rbind(out.df, c(gen=i, freq=freq.a, pop=pop.size, rep=rep));
  }
  out.list;
}
out.1e2<-wright.fisher(100)
out.1e3<-wright.fisher(1000)
# Make Figure 1 above
plot(out.1e2[,1], out.1e2[,2], type = "l", las=1, ylim=c(0,1), xlab="generation", ylab="allele frequency", main="Wright-Fisher Process (Genetic Drift)", col=3, lwd=2)
lines(out.1e3[,1], out.1e3[,2], type = "l", col=2, lwd=2)
legend(400, 0.2, c("N=100", "N=1,000"), lty=1, col=3:2, cex=0.75)
abline(h=0.5, col="gray", lty=2)
# Second function to track founders (not frequency)
wright.fisher.2 <- function(pop.size, generation.time=1000, gametes.per.ind=100) {
  pop <- 1:100 # label founders
  out.list <- list(pop);
  for(i in 1:generation.time) {
    gametes <- unlist(lapply(pop, function(x) rep(x, gametes.per.ind)));
    pop <- sample(gametes, pop.size); # sample gamete pool without replacement
    out.list[[i+1]] <- pop;
  }
  out.list;
}
out.1ist<-wright.fisher.2(100)
# plot loss of genetic diversity by tracking surviving founder lineages
plot(x=1:100, y=rep(1, 100), ylim=c(0,100), xlim=c(1,600), type="n", las=1, main="Drift: Suvival of founder lineages\n(N=100)", cex.main=0.75, xlab="generation", ylab="founders")
for (i in 1:600) { points(rep(i,100), 1:100, col=colors()[out.list[[i]]], cex=0.5) }
for (i in 1:600) {points(rep(i,100), out.list[[i]], cex=0.2, col="yellow") }
</syntaxhighlight>
|}
== March 11, 2017 "Stoplights" (Due 3/18/2017)==
[[File:Stoplight-1.png|framed|right]]
* Source: Paul Nahin (2008). "Digital Dice", Problem 18.
* Challenge: How many red lights, on average, will you have to wait for on your journey from a city block m streets and n avenues away from Belfer [with coordinates (1,1)]? (assuming equal probability for red and green lights)
* Note that one has only wait for green light when walking along either the north side of 69 Street or east side of 1st Avenue. On all other intersections, one can walk non-stop without waiting for green light by crossing in the other direction if a red light is on.
* Formulate your solution with the following steps:
# Start from (m+1,n+1) corner and end at (1,1) corner
# The average number of red lights for m=n=0 is zero
# Find the average number of red lights for m=n=1 by simulating the walk 100 times
# Increment m & n by 1 (but keep m=n), until m=n=1000
# Plot average number of red lights by m (or n).
{| class="wikitable sortable mw-collapsible"
{| class="wikitable sortable mw-collapsible"
|- style="background-color:lightsteelblue;"
|- style="background-color:lightsteelblue;"
! Assignment #4. Due 3/16. (Finalized)
! scope="col" style="width: 300px;" | Submitted Codes
|- style="background-color:powderblue;"
|- style="background-color:powderblue;"
|  
|  
# (2 pts) Repeat the above sampling experiment (sample size N =100). Save results to a vector called "mean.100" (which is a vector of means, with a length of 1000 elements). Show histogram. Show mean. Show standard deviation.
*By John
# (2 pts) Repeat the above sampling experiment (sample size N =500). Save results to a vector called "mean.500". Show histogram. Show mean. Show standard deviation
<syntaxhighlight lang="python">
# (2 pts) Repeat the above sampling experiment (sample size N =5000). Save results to a vector called "mean.5000". Show histogram. Show mean. Show standard deviation
import numpy as np
# (1 pt) Explain why mean is not a good description of a "typical" human gene length and which statistic is a better?
import matplotlib.pyplot as plt
# (0.5 pt) Sort the human genes by length (using the <code>order()</code>) & created a sorted data.frame ("hg.sorted") of the original data.frame ("hg").
import pandas as pd
# (0.5 pt) Obtain human genes longer than 1 million bases by using the <code>which()</code> function & create a new data.frame of long genes ("hg.long")  
from pandas import DataFrame
# (2 pts) Combine the three simulated-mean vectors into one (see below). Define "Standard error" & "confidence interval (CI)". Does CI quantify accuracy or precision?
from random import sample
<syntaxhighlight lang="bash">
from itertools import combinations, permutations
# Combine
from numpy import count_nonzero, array
sample.combined <- cbind(mean.100, mean.500, mean.5000)
colnames(sample.combined) <- c("samp.100", "samp.500", "samp.5000")
def red_light(av, st):
# plot in a single frame
    traffic_light = ["red_st", "red_av"]
par(mfrow=c(3,1))
    total_cross = av + st - 1
hist(sample.combined[,1], br=100, xlim=c(1e4, 2e5), main="sample size 20", xlab = "mean gene length")
    while av != 0 and st != 0:
hist(sample.combined[,2], br=100, xlim=c(1e4, 2e5), main = "sample size 100", xlab = "mean gene length")
        if sample(traffic_light, 1)[0] == "red_st":
hist(sample.combined[,3], br=100, xlim=c(1e4, 2e5), main = "sample size 500", xlab = "mean gene length")
            av -= 1
par(mfrow =c(1,1))
        else:
            st -= 1
    rest_cross = av if av != 0 else st
    return(count_nonzero([sample([0, 1], 1) for corss in range(rest_cross)]))
df = DataFrame(0, index=np.arange(10), columns=np.arange(10))
simulation = 1000
for av in df.index:
    for st in df.index:
        df.loc[av, st] = sum(array([[red_light(av, st)] for n in range(simulation)])) / simulation
       
plt.imshow(df, cmap='hot', interpolation='nearest')
plt.xticks(np.arange(1, 10))
plt.yticks(np.arange(1, 10))
plt.title("Average Numbers Waiting for Red Light")
plt.xlabel("Number of Av")
plt.ylabel("Number of St")
plt.show()
df
# Recursive Function for Each Trial
 
from sys import stdout
av = 30
st = 30
n_space = [0]
traffic_light = ["red_st", "red_av"]
def route(av, st):
    if not av == st == 0:
        if sample(traffic_light, 1)[0] == "red_st":
            if av == 0:
                stdout.write("w_")
                n_space[0] += 1
                route(av, st - 1)
            else:
                stdout.write("\n" + " " * n_space[0] + "|")
                route(av - 1, st)
        else:
            if st == 0:
                stdout.write("\n" + " " * n_space[0] + "w" + "\n" + " " * n_space[0] + "|")
                route(av - 1, st)
            else:
                stdout.write("_")
                n_space[0] += 1
                route(av, st - 1)
route(av, st)
</syntaxhighlight>
</syntaxhighlight>
|}


===March 16. Hypothesis Testing===
*By Weigang
* Chapter 6
<syntaxhighlight lang="perl">
* In-Class exercise 1. The following are measurements of body mass (in grams) of three species of finches in Africa, calculate mean, standard deviation, and CV of each species. Make a boxplot and a strip chart separated by species
#!/usr/bin/env perl
#Species 1: 8, 8, 8, 8, 8, 8, 8, 6, 7 ,7, 7, 8, 8, 8, 7, 7
use strict;
#Species 2: 16, 16, 16, 12, 16, 15, 15, 17, 15, 16, 15, 16
use warnings;
#Species 3: 40, 43, 37, 38, 43, 33, 35, 37, 36, 42, 36, 36, 39, 37, 34, 41
 
# Use the 2SE rule of thumb, calculate 95% confidence interval.
# Use recursion
# Plot standard error & standard deviation
my ($distx, $disty) = @ARGV;
<syntaxhighlight lang=R">
my $num_red = 0; # keep red-light counts
df1 <- data.frame(bm = c(8, 8, 8, 8, 8, 8, 8, 6, 7 ,7, 7, 8, 8, 8, 7, 7), species = rep("sp1", 16))
print "-" x $distx, "\n"; # print a starting line
df2 <- data.frame(bm = c(16, 16, 16, 12, 16, 15, 15, 17, 15, 16, 15, 16), species = rep("sp2", 12))
&walk($distx, $disty, \$num_red); # pass reference not value
df3 <- data.frame(bm = c(40, 43, 37, 38, 43, 33, 35, 37, 36, 42, 36, 36, 39, 37, 34, 41), species = rep("sp3", 16))
exit;
df.bm <- rbind(df1, df2, df3)
 
boxplot(bm ~ species, data= df.bm)
sub walk {
sd(df1$bm)
    my ($x, $y, $ref) = @_;
mean.1 <-mean(df1$bm)
    my $ct = $$ref;
se.1 <- sd(df1$bm)/sqrt(nrow(df1))
    my $prob_red;
ci.1 <- c(mean(df1$bm) - 2 * se.1, mean(df1$bm) + 2*se.1)
    if ($x == 1 && $y == 1) { # Reached destination
bm.aov <- aov(bm ~ species, data = df.bm)
print "*\n";
print "-" x $distx, "\n"; # print the ending line
print "reached Belfer after waiting for ", $ct, " red lights\n";
    }
 
    if ($x == 1 && $y > 1) { # Reached right-side end
$prob_red = rand();
if ($prob_red < 0.5) { # red light, wait
    $ct++;
    print "w\n", " " x ($distx-1);
} else {
    print "|\n", " " x ($distx-1);
}
&walk($x, $y-1, \$ct);
    }
 
    if ($x > 1 && $y == 1) { # Reached bottom
$prob_red = rand();
if ($prob_red < 0.5) { # red light, wait
    $ct++;
    print "w";
} else { print "-"}
&walk($x-1, $y, \$ct);
    }
 
    if ($x > 1 && $y > 1) {
my $prob_across = rand(); # prob of walking right with green light
if ($prob_across >= 0.5) { # move one block right
    print "-";
    &walk($x-1, $y, \$ct);
} else { # red light, move one block down
    print "|\n", " " x ($distx-$x);
    &walk($x, $y-1, \$ct);
}
    }
}
</syntaxhighlight>
</syntaxhighlight>
* In-Class exercise 2. Hypothesis testing through simulation
 
<syntaxhighlight lang=R">
*By Jeff
# coin-flipping experiments
<syntaxhighlight lang="python">
runif(1) # take a random sample from 0-1, uniformly distributed
# In[26]:
rbinom(n = 1, size =100, prob = 0.5) # flipping 100 (size) fair (prob) coin, one (n=1) time
def walk (walker):
rbinom(n = 1000, size =100, prob = 0.5) # repeat above 1000 times
    import random
num.success <- rbinom(n = 1000, size =100, prob = 0.5) # save
   
barplot(table(num.success)) # distribution of number of successes
    def just_walk (x):                                  #[1]
length(which(num.success<=40))/1000 # probability of success less than or equal to 40
        if random.choice(["red_m_direction", "green_m"])=="green_m": x["loca"]["m"] -=1
# test if toads are right-handed: observation 14/18 are right-handed
        else: x["loca"]["n"] -=1
right.handed.toads.by.chance <- rbinom(n = 1000, size = 18, prob = 0.5) # null distribution, 1000 times
        return x   
barplot(table(right.handed.toads.by.chance)) # plot
    def may_have_to_wait (dist,waited):                #[2]
length(which(right.handed.toads.by.chance >= 14))/1000 # probability of getting a value equal or higher than 14
        if random.choice(["green","red"])=="red": waited += 1
# Binomial test
        else: dist -= 1
binom.test(x=14, size=18, p =0.5)
        return (dist,waited)   
 
    while (0 not in walker["loca"].values()):        #start walking
        walker = just_walk(walker)  
    while (walker["loca"]["m"] !=0):      # if n =0 and m != 0
        (walker["loca"]["m"], walker["waited"]) = may_have_to_wait(walker["loca"]["m"],walker["waited"])
    while (walker["loca"]["n"] !=0):      # if m =0 and n != 0
        (walker["loca"]["n"], walker["waited"]) = may_have_to_wait(walker["loca"]["n"],walker["waited"])  
 
    return walker["waited"]
 
def given_distance(m,n,rep):
    waited = list()
    for x in range(rep):
        walker={"loca":{"m":m,"n":n}, "waited":0}    #walker variable
        waited.append(walk(walker))     
    return sum(waited)/len(waited)
# main
result=list()
for d in range(1000):                                  # set walking distance here
        result.append(given_distance (d,d,100))     # set rep here
import matplotlib.pyplot as plt                    # ploting
plt.plot(result); plt.show()
 
#[1] no reason to wait for ANY red before one of the distances (m or n) is exhausted. Assuming when light is green for m, it must be red for n, vise versa.
#[2] when one of the directions (m or n = 0), waiting cannot be avoided, start counting.  
</syntaxhighlight>
</syntaxhighlight>
|}
== March 3, 2017 "PiE" (Due 3/10/2017)==
* Source: Paul Nahin (2008). "Dueling Idiots", Problem 5.
* Challenge: obtain numerical values of Pi and E by simulations
* Simulate pi by [[File:Pi-sim.png|thumbnail]]
# Randomly generate 10,000 pairs of uniformly distributed numbers from 0 and 1 (simulating throwing darts onto the unit square shown at right)
# Count the number of points enclosed within the quarter-circle
# Calculate the pi value from this proportion
* Simulate e by
# Generate N random numbers from 0 to 1
# Divide into N equal-width bins between 0 and 1
# Count the number of bins Z that receive none of the random numbers
# Obtain e ~ N/Z (based on binomial sampling formula)
# Simulate N=1e2, 1e3, and 1e4
{| class="wikitable sortable mw-collapsible"
{| class="wikitable sortable mw-collapsible"
|- style="background-color:lightsteelblue;"
|- style="background-color:lightsteelblue;"
! Assignment #5. Due 3/23 (Finalized)
! scope="col" style="width: 300px;" | Submitted Codes
|- style="background-color:powderblue;"
|- style="background-color:powderblue;"
|  
|  
# (1 pt) Make a single data frame containing the gene expression values for two genes (Hint: create two dataframes and name two columns as "expression" and "gene", then use <code>rbind()</code> to combine two dataframes into one)
* By Weigang
## "MED1": 12.38918, 9.084664, 9.48416, 9.363928, 8.194495, 8.694836, 8.771101, 9.998151, 12.66877, 8.684064, 8.944236, 11.8491, 8.40968, 8.990329, 9.782376, 8.58243, 12.00455, 8.580401, 9.161046, 9.047977, 8.672018, 8.811856, 8.354933, 8.763175
<syntaxhighlight lang="bash">
## "ZBTB42": 8.377784, 7.832712, 8.65289, 4.59474, 5.598869, 4.912963, 5.24125, 7.688584, 7.36693, 4.463853, 5.646581, 6.830076, 4.485883, 6.741698, 6.967342, 5.307032, 6.80991, 7.612475, 5.795508, 5.033554, 5.032286, 4.979937, 8.315718, 5.801263, 7.136532, 4.722164, 5.416593, 4.456056, 6.253954, 5.684245, 8.255962, 8.629676, 8.348159, 8.114049, 6.786746, 7.893434, 7.836647, 4.733391, 6.895385, 7.123281, 4.75207
# simulate pi:
# (1 pt) Make a stripchart of expression values separated by genes.
darts <- sapply(1:1000, function(x) { coords<- runif(2); return(ifelse(coords[1]^2+coords[2]^2 <= 1, 1,0)) })
# (1 pt) Obtain the mean, median, standard deviation, standard error, and 95% confidence intervals of expression for each gene
pi <- 4*mean(darts)
# (1 pt) Run two-sampled t test <code>t.test</code> to test if two genes differ significantly in expression levels. Specify what is the null hypothesis and what is the alternative hypothesis
 
# (2 pts) Identify whether each of the following statements is more appropriate as the null (H0) or alternative (H1) hypothesis. State appropriate H0 and H1 for each question.
# simulate e
## Most genetic mutations are deleterious
N <- 1e3;
## A diet has no effect on liver function
n <- runif(N);
# (4 pts) Can parents distinguish their own children by smell alone? In a study, eight of nine mothers identified their children correctly based on the smell of T-shirts children wore for three consecutive nights.
cts <- numeric();
## State the null & alternative hypotheses.
for(i in 1:(N-1)) {
## Simulate the null distribution using the "rbinom()" function (with n = 1 million). Plot the distribution using "barplot"
  left <- i/N;
## Verify your result with the "binom.test()" function.
  right <- (i+1)/N;
## Conclude by explain the p-value.
  cts[i] <- length(which(n>=left & n < right))
}
e <- N/length(which(cts==0))
</syntaxhighlight>
 
*By Nicolette
<syntaxhighlight lang="bash">
#Simulate pi
random <- 0; square <- 0;
for (i in 1:1000){
  random[[i]]<- runif(1,0,1)
square[[i]]<- sqrt(1-(random[i])^2)
}
plot(random, square)
areaofqc <- (pi/4)
ranleqc <- length(which(random <=areaofqc))
squleqc <- length(which(square<=areaofqc))
 
#Simulate e
e <- 0;
for(N in 1:1000) {
  numbers<-runif(N)
bin.size<-1/N
  non.empty<-as.integer(numbers/bin.size)
  z.empt<- N - length(table(non.empty))
  e<-c(e, N/z.empt)
  }
plot(e)
</syntaxhighlight>
 
*By Brian
<syntaxhighlight lang="bash">
#When you throw a dart at the a unit square dartboard the Probability of hitting the portion of the circle radius with center=(0,0) inside the unit square is  pi/4. We can say that hitting the 1/4 circle within the unit square with a dart is a bernoulli random variable where p=pi/4. Further, we can say E(bernoulli=hit)=pi/4.  
# Imagine you are scoring the game of darts as follows- 1 point if you throw in the 1/4 circle and 0 points if you miss this space.  
# If you throw 10000 darts you just did 10000 iid bernoulli trials. By the law of large numbers if we count the number of hits to the 1/4 circle of radius 1 and divide by number of darts thrown we will get something pretty close to E(beroulli) . Multiply that number by 4 and you have an estimate of Pi.
 
## Generate 10,000 pairs of points in the unit square with uniform distribution
 
x<-c(runif(1000000, min = 0, max = 1))
y<-c(runif(1000000, min = 0, max = 1))
point<-data.frame(x,y)
 
plot(point$x,point$y)
point.sub<-subset(point,y<=sqrt(1-x^2))
 
plot(point.sub$x,point.sub$y)
 
z<-4*nrow(point.sub)/1000000
z
error<-pi-z
error
 
## simulating exp
#It's a well that a binomial with large n and tiny p is a good estimate of the poisson random For this estimate lambda=np. In this case n=10,000 and p=the probability of falling into a particular unit=1/10000.......  
simexp<-function (n) {
a<-c(runif(n,min=0,max=1))
b<-c()
 
for (i in 1:n) {
  c<-subset(a,(i-1)/n<a & a<=i/n)
  d<-length(c)
  b<-append(b,d)
}
bzero<-subset(b,b==0)
length(bzero)
print(n/length(bzero))
}
simexp(1000)
simexp(10000)
simexp(100000)
 
</syntaxhighlight>
 
*By John
<syntaxhighlight lang="bash">
#python code
# Simulate pi
simulation = 10000
distances = np.array([pow(uniform(0, 1), 2) + pow(uniform(0, 1), 2) for i in range(simulation)])
pi = count_nonzero(distances < 1) / simulation * 4
pi
 
# Simulate e
total = 10000
ranges = [[x/total, (x+1)/total] for x in range(total)]
sample_space = [uniform(0, 1) for x in range(total)]
index = []
for i in range(total):
    if any(ranges[i][0] <= point <= ranges[i][1] for point in sample_space):
        index.append(i)
e = total / (total - len(set(index)))
e
</syntaxhighlight>
|}
|}


===March 23. One-sampled & Two-sampled t-test===
==Feb 24, 2017 "Idiots" (Due 3/3/2017)==
* Chapters 10-12
[[File:Idiots.png|thumbnail]]
* Lecture Slides: [[File:Lecture-slides-part-2.pdf|thumbnail]]
* Source: Paul Nahin (2000), "Dueling Idiots". Problem #2: "When Idiots Duel"
* Game: Idiots A and B decide to duel with a gun in the following way: They will insert a single bullet into the gun's cylinder. The cylinder has a total of 6 slots. Idiot A will spin the cylinder and shoot at B. If the gun doesn't fire, then A will give the gun to B, who will spin the cylinder and then shoot at A. This back-and-forth duel will continue until one fool shoots (and kills) the other.
* Questions: (1) what is the probability that A will win (and B dies); (2) What is the average number of trigger pulls before someone dies?
{| class="wikitable sortable mw-collapsible"
{| class="wikitable sortable mw-collapsible"
|- style="background-color:lightsteelblue;"
|- style="background-color:lightsteelblue;"
! Assignment #6. Due 3/30 (Finalized)
! scope="col" style="width: 300px;" | Submitted Codes
|- style="background-color:powderblue;"
|- style="background-color:powderblue;"
|  
|  
* (5 pts) Ostriches live in hot environments. To test if ostriches could reduce brain temperature relative to body temperature, the mean body and brain temperature (in Celsius) of six ostriches was recorded in the following table.
* By Roy
# State the null and alternative hypotheses
<syntaxhighlight lang="bash">
# Make a data frame and visualize with a boxplot combined with a stripchart
gun <-c(1,0,0,0,0,0)
# Test if there is significant difference between the body and brain temperature. [Hint: Choose the most appropriate t-test among the one-sample, paired, or two-sample t-test]
gun2 <- c(1,0,0,0,0,0)
{| class="wikitable"
deadman <-0 ;idiot1win <-0; idiot2win <- 0; nooned <- 0; total <- 0
|-
while(deadman < 1000){
! Ostrich !! Body Temperature !! Brain Temperature
  idiot1shot <-sample(gun)
|-
  if (length(which(idiot1shot[1] == 1))){
| 1 || 38.51 || 39.32
    deadman <- deadman + 1
|-
    idiot1win <- idiot1win + 1
| 2 || 38.45 || 39.21
  }else{
|-
      idiot2shot <-sample(gun2)
| 3 || 38.27 || 39.2
      if (length(which(idiot2shot[1] == 1))){
|-
        deadman <- deadman +1
| 4 || 38.52 || 38.68
        idiot2win <- idiot2win +1
|-
      } else {nooned <- nooned + 1}
| 5 || 38.62 || 39.09
  }
|-
  total <- total + 1
| 6 || 38.18 || 38.94
  }
|}
 
* (5 pts) Could mosquitoes detect odor of what people drink? The following data points are the relative activation time (the smaller the quicker) of mosquitoes flying toward volunteers, divided into beer- and water-drinking groups:
 
** Beer group: 0.36, 0.46, 0.06, 0.18, 0.25, 0.18, -0.06, -0.14, 0.12, 0.39, 0.17, -0.16, -0.05, 0.19, 0.25, 0.31, 0.17, -0.03, 0.23, -0.03, 0.26, 0.30, 0.11, 0.13, 0.21
deadman
** Water group: 0.04, 0, -0.08, -0.12, 0.201, -0.039, 0.10, 0.041, 0.02, 0.236, 0.05, 0.097, 0.122, -0.019, 0.021, -0.08, -0.165, -0.28
idiot1win
# State the null and alternative hypotheses
idiot2win
# Make a data frame and visualize with a boxplot combined with a stripchart
nooned
# Test normality of data points by showing qqnorm() and qqline() plots (separate for two column)  
total
# Test if there is significant difference between the two groups using a t-test.
 
p <- idiot1win/1000
p*100
takes2kill <- deadman/total
takes2kill*100
idiot1shot
idiot2shot
</syntaxhighlight>
 
* By Weigang
<syntaxhighlight lang="bash">
rounds <- sapply(1:1000, function(x) {
  alive <- 1;
  round <- 0;
  while(alive == 1){
    spin <- sample(c(0,0,0,0,0,1));
    round <- round + 1;
    if(spin[1] == 1) { alive <- 0 }
  }
  return(round)
})
prob.a.live <- length(which(rounds %% 2 == 0))
barplot(table(rounds)/1000, xlab="num rounds", ylab="Prob", main = "Sudden Death with a 6-slot gun (sim=1000)", las=1)
</syntaxhighlight>
|}
|}


===March 30. Exam 2===
==Feb 17, 2017 "Birthday" (Due 2/24/2017)==
 
[[File:B-day.png|thumbnail]]
===April 6. Analyzing Proportions===
* Problem: What is the probability NONE of the N people in a room sharing a birthday?
* Chapter 7 & 8
# Randomly select N individuals and record their B-days
# Count the B-days NOT shared by ANY two individuals
# Repeat (for each N) 1000 times, and obtain probability by averaging the previous counts (i.e., divided by 1000)
# Vary N from 10 to 100, increment by 10
# Plot probability of no-shared B-Day (Y-axis) versus N (x-axis), with either a stripchart or boxplot, or both
{| class="wikitable sortable mw-collapsible"
{| class="wikitable sortable mw-collapsible"
|- style="background-color:lightsteelblue;"
|- style="background-color:lightsteelblue;"
! Assignment #7. Due 4/19 (Finalized)
! scope="col" style="width: 300px;" | Submitted Codes
|- style="background-color:powderblue;"
|- style="background-color:powderblue;"
|  
|  
# (Analysis of frequency data. 5 pts) A Siena College Poll has been conducted between April 6-11, 2016 by telephone calls. Based on a sample of 538 likely Democratic primary voters in New York State (which votes next Tuesday on 4/19/2016), the poll concludes that "Vermont Senator Bernie Sanders has narrowed the lead against former New York Senator Hillary Clinton among likely Democratic voters, however, she still has a double digit lead 52-42 percent".
* By Roy
## What are the estimates of proportion of voters supporting Clinton and Sanders, respectively? Name the two variables <code>p.clinton</code> & <code>p.sanders</code>
<syntaxhighlight lang="bash">
## What are the standard errors (i.e., "margins of error" of the pool) of these two estimates? Use the formula <code>SE[p]=sqrt(p*(1-p)/n)</code> 
N <- 0
## What are the 95% confidence intervals for these two estimates? Use the approximate formula <code>p-2*SE[p], p+2*SE[p]</code>
samples <- 0;
## Assuming that the number of Sanders supporters is 226 out of 538 in this pool, use the <code>binom.test()</code> function to test the null hypothesis that the proportion of Sanders supporter is <code>p=0.5</code>. Could the null hypothesis be rejected at p value of 0.05?
prob.nodub <-0;
## Would you predict based on this pool that Clinton would win the New York Primary?
for(j in 10:100){
# (Test of goodness-of-fit. 5 pts) Variation in flower color of a plant species is determined by a single gene, with ''RR'' individuals being red, ''Rr'' individuals pink, and ''rr'' individuals white. The expected color ratio of crossing F1 individuals is 1:2:1.
  counting.no.dups <- 0;
## In one cross experiment, the results were 10 red, 21 pink, and 9 white-flowered offspring. Do these results differ significantly from the expected frequencies (at 5% level of significance)? Create a data frame with observed and expected counts as the two columns. Calculate the chi-square value and degree of freedom. Obtain p value by using the <code>pchisq()</code> function
  test <- for(i in 1:1000){
## Validate your above test using the <code>chisq.test(x=c(observed counts), p=c(0.25, 0.5, 0.25))</code> function. Save test results and show expected counts
  bdays <- sample(seq(as.Date('1990/01/01'), as.Date('1990/12/31'), by="day"), N, replace=T)
## In another, larger experiment, 1000 red, 2100 pink, and 900 white flowers were observed. Do these results differ significantly from the expected ratio?
  dups <- duplicated(bdays, incomparables = FALSE)
## How would you explain the difference between the two results?
  ch <-length(which(dups == TRUE))
    if(ch==0){
      counting.no.dups <- counting.no.dups +1
    }
    fine <- (counting.no.dups/1000)
  }
  print(N)
  print(fine)
  N <- N + 1
  samples[[j]] <- N
  prob.nodub[[j]] <- fine
}
plot(samples,prob.nodub, main="Birthday Simulation", xlab = "Sample Size", ylab = "Probability of no Duplicates", las =1)
</syntaxhighlight>
* By weigang
<syntaxhighlight lang="bash">
days <- 1:365;
 
find.overlap <- function(x) { return(length(which(table(x)>1))) }
 
output <- sapply(1:100, function(x) { # num of people in the room
  ct.no.overlap <- 0;
  for (k in 1:100) {
    bdays <- sample(days, x, replace = T);
    ct <- find.overlap(bdays);
    if (!ct) { ct.no.overlap <- ct.no.overlap + 1}
  }
  return(ct.no.overlap);
})
plot(1:100, output/100, xlab="Group size", ylab = "Prob (no shared b-day)", las=1, main="B-day (sim=100 times)", type="l")
abline(h=0.5, lwd=2, col=2, lty=2)
abline(v=seq(0, 100, 5), col="gray")
</syntaxhighlight>
|}
|}


===April 13. No Class (Spring Break)===
==Feb 10, 2017 "Dating" (Valentine's Day Special; Due 2/17/2017)==
===April 20. No Class (Monday Schedule)===
[[File:Dating.png|thumbnail]]
===April 27. Contingency Analysis===
* Source: Paul Nahin (2008), "Digital Dice". Problem #20: "An Optimal Stopping Problem"
* Chapter 9.
* Problem: What is the optimal time point when one should stop dating more people and settle on a mate choice (and live with the decision)
* Lecture Slides: [[File:Part-3-frequency-2017.pdf|thumbnail]]
* Your best strategy is to date an initial sample of N individuals, rejecting all, and marry the next one ranked higher than any of your N individuals. The question is what is the optimal number for N.
# The problem could be investigated by simulating a pool of 10 individuals, ranked from 1-10 (most desirable being 1) and then take a sample of N
# You may only date one individual at a time
# You cannot go back to reach previously rejected candidates
# Simulate N from 0 to 9 (zero means marrying the first date, a sample size of zero)
# For each N, obtain the probability of finding the perfect mate (i.e., ranked 1st) by running simulation 1000 times
# Plot barplot of probability versus sample size N.
# Expected answer: N=4
{| class="wikitable sortable mw-collapsible"
{| class="wikitable sortable mw-collapsible"
|- style="background-color:lightsteelblue;"
|- style="background-color:lightsteelblue;"
! Assignment #8. Due 5/4 (Finalized)
! scope="col" style="width: 300px;" | Submitted Codes
|- style="background-color:powderblue;"
|- style="background-color:powderblue;"
|  
|  
The following table shows results of genotype counts in "Taster" and "non-Taster" individuals.
* By Weigang
<center>
<syntaxhighlight lang="bash">
{| class="wikitable"
pick.candidate <- function(min, array) {
|-
  for (i in 1:length(array)) {
!  !! CC !! CT !! TT
    if (array[i] < min) {
|-
      return(array[i])
| Taster || 44 || 20 || 40
    } else {next}
|-
  }
| Non-Taster || 19 || 10 || 39
  return(0) # No 1. has been sampled and rejected
|}
}
</center>
candidates <- 1:10;
# Hardy-Weinberg Analysis
output <- sapply(0:9, function(x) {
## (1 pt) Calculate overall genotype counts (regardless of phenotype)
  ct <- 0;
## (1 pt) Test the observed genotype counts against Hardy-Weinberg expectations (consult lecture slides)
  for(k in 1:1000) {
## (2 pt) Perform a chi-square test using observed counts and expected proportions. Is the result significant? State for each genotype if it is under- or over-presented. What assumptions of Hardy-Weinberg equilibrium are not met, if your test result is significant?
    if (x==0) { # no sample, marry the 1st guy
# Genotype-phenotype association analysis
      sampled <- sample(candidates, 1);
## (1 pt) State the null & alternative hypotheses
      if (sampled == 1) {ct <- ct+1}
## (1 pt) Create a matrix called "taster.genotypes" and display the data as a mosaic plot. Make sure that the explanatory variable is on the ''X''-axis
    } else {
## (2 pts) Test the genotype-phenotype association with a chi-square test, with simulated p value. Conclude if there is significant association between the genotypes and phenotype
      sampled <- sample(candidates, x);
## (2 pts) Save the above test result as "taste.chisq". Show observed & expected counts and identify over- and under-represented for each genotype/phenotype combination
      not.sampled <- candidates[-sampled];
      not.sampled <- sample(not.sampled);
      if (pick.candidate(min =  min(sampled), array = not.sampled) == 1) {ct <- ct+1}
    }
  }
  return(ct);
})
barplot(output/1e3, names.arg = 0:9, xlab = "number of sampled dates", las=1, main = "Optimal stopping for dating (N=10 candidates)", ylab = "Prob(marrying No.1)")
</syntaxhighlight>
|}
|}


===May 4. Exam 3===
==Feb 3, 2017 "US Presidents" (Due 2/10/2017)==
* Review lecture slides (part 3)
[[File:Sim-presidents.png|thumbnail]]
* Review two assignments
* Download [[File:Presidents.txt|thumbnail]]: 1st column is the order, 2nd column is the name, the 3rd column is the year of inauguration; tab-separated
 
* Your job is to create an R, Perl, or Python script called “us-presidents”, which will
===May 11. Correlation===
# Read the table
* Chapter 16. [http://whitlockschluter.zoology.ubc.ca/r-code/rcode16 R Data & Code on Publisher's Website]
# Store the original/correct order
* Lecture slides: [[File:Part-4-correlation-spring-2017.pdf|thumbnail]]
# Shuffle/permute the rows and record the new order
* In-class Exercise 1. Aggression behavior in birds. (1) Make a data frame with two columns; (2) Make a scatter plot with title and axes labels
# Count the number of matching orders
{| class="wikitable"
# Repeat Steps 3-4 for a 1000 times
|-
# Plot histogram or barplot (better) to show distribution of matching counts
! Variable !! Values
# Hint: For R, use the sample() function. For Perl, use the rand() function.
|-
| Number of visits by non-parent adults || 1,7,15,4,11,14,23,14,9,5,4,10,13,13,14,12,13,9,8,18,22,22,23,31
|-
| Future aggressive behavior || -0.8,-0.92,-0.8,-0.46,-0.47,-0.46,-0.23,-0.16,-0.23,-0.23,-0.16,-0.10,-0.10,0.04,0.13,0.19,0.25,0.23,0.15,0.23,0.31,0.18,0.17,0.39
|}
* In-class Exercise 2. Inbreeding in wolfs. (1) Make a data frame with two columns; (2) Make a scatter plot with title and axes labels; (3) Calculate correlation coefficient and test its significance by using the <code>cor.test()</code> function
{| class="wikitable"
|-
! Variable !! Values
|-
| Inbreeding coefficient between mating pairs || 0,0,0.13,0.13,0.13,0.19,0.19,0.19,0.22,0.24,0.24,0.24,0.24,0.24,0.24,0.25,0.27,0.30,0.30,0.30,0.30,0.36,0.37,0.40
|-
| Number of pups surviving first winter || 6,6,7,5,4,8,7,4,3,3,3,3,2,2,6,3,5,3,2,1,3,2,3
|}
* In-class Exercise 3. Miracles of memory. (1) Make a data frame with two columns; (2) Make a scatter plot with title and axes labels; (3) Run a rank correlation using <code>cor.test(method="spearman")</code> function
{| class="wikitable"
|-
! Variable !! Values
|-
| Years elapsed || 2,5,5,4,17,17,31,20,22,25,28,29,34,43,44,46,34,28,39,50,50
|-
|Impressive score || 1,1,1,2,2,2,3,4,4,4,4,4,4,4,4,4,4,5,5,5,5
|}
{| class="wikitable sortable mw-collapsible"
{| class="wikitable sortable mw-collapsible"
|- style="background-color:lightsteelblue;"
|- style="background-color:lightsteelblue;"
! Assignment #9. Due 5/18 (Finalized)
! scope="col" style="width: 300px;" | Submitted Codes
|- style="background-color:powderblue;"
|- style="background-color:powderblue;"
|  
|  
(Note: Lecture slides has been posted. For help, refer to the slides and [http://whitlockschluter.zoology.ubc.ca/r-code/rcode16 R Data & Code on Publisher's Website])
* By Mei
<syntaxhighlight lang="bash">
pres.list <- lapply(1:1000, function(x) pres[sample(nrow(pres)),])
cts <- sapply(pres.list, function(x) {
  ct.match <- 0;
  for (i in 1:45){
    if (pres$order[i] == x[i,1]){
      #cat(as.character(x[i,2]), x[i,1], "\n")  #as.character to avoid the factor info
      ct.match <- ct.match + 1;
      #cat(ct.match)
    }
  }
  ct.match #return
})
barplot(table(cts),xlab = "Number of matches per shuffle", border = "hotpink", col = "pink", ylab = paste("Frequency total of:",length(pres.list)), main = "US Presidents")
</syntaxhighlight>
* By John
<syntaxhighlight lang="python">
import pandas as pd
import numpy as np
from pandas import DataFrame
import matplotlib.pyplot as plt


Question 1 (6 pts). In their study of hyena laughter (or "giggle"), Mathevon et al (2010) asked whether sound spectral properties of the giggles are associated with age. Perform correlation analysis using data in the following table:
df = pd.read_table("presidents.txt", names=["num", "name", "presidency"])
{| class="wikitable"
|-
| Age (years) || 2,2,2,6,9,10,13,10,14,14,12,7,11,11,14,20
|-
| Frequency (Hz) || 840,670,580,470,540,660,510,520,500,480,400,650,460,500,580,500
|}
* Identify explanatory and response variables and their data types.
* Make a data frame and test normality of both variables
* Make a scatterplot including axes labels and a main title
* Calculate correlation coefficient between the two variables and show test results
* State null and alternative hypotheses of the correlation test
* Conclude if there is a correlation between the age and "giggle", if it is statistically significant, and whether positive or negative if the relationship is significant.


Question 2(4 pts). A mutation in the <i>SLC11A1</i> gene in humans causes resistance to tuberculosis. Barnes et al (2011) examined the frequency of the resistant allele in different towns in Europe and Asia and compared it to how long humans had been settled in the site ("duration of settlement"):
name_list = list(df.name)
{| class="wikitable"
|-
| Duration of settlement (years) || 8010,5260,4735,4010,3710,2810,2730,2310,2110,1955,1910,1300,378,194,130,110,91
|-
| Allele frequency || 0.99,1.0,0.948,0.885,0.947,0.854,1.0,0.769,0.9956,0.979,0.865,0.922,0.821,0.842,0.734,0.766,0.772
|}
* Make a data frame and test variable normality.
* Examine data with a scatterplot (with axes labels).
* Since the relationship appears curvilinear, use Spearman's rank correlation to test the association
* Show test results and state your conclusions (if there is a correlation, if it is significant, and whether the relation is positive or negative if significant).
|}


===May 18. Regression===
# create a list to store matches after each shuffle
* Chapter 16. [http://whitlockschluter.zoology.ubc.ca/r-code/rcode17 R Data & Code on Publisher's Website]
shuffle_record = []
* In-class Exercise 1. Lion noses. Linear regression with regression line & confidence intervals
* In-class Exercise 2. Prairie Home Companion. Linear regression with test of non-zero slope
* In-class Exercise 3. Iris pollination. Square-root transformation
* In-class Exercise 4. Iron and phytoplankton growth. Non-linear regression
* In-class Exercise 5. Guppy cold death. Logistic regression
* Teacher's Evaluation
** Students can complete them now in 3 simple steps:
*** Visit [http://www.hunter.cuny.edu/te www.hunter.cuny.edu/te] OR [http://www.hunter.cuny.edu/mobilete www.hunter.cuny.edu/mobilete] (for smartphones)
*** Sign in with your net ID and net ID password (forgot your password? Click here: https://netid.hunter.cuny.edu/verify-identity)
*** Complete the evaluation for your instructor(s)
** Notes to students:
*** Your responses are completely anonymous, and that instructors can only see results after grades are released.
*** Teacher evaluations serve a number of important functions such as: improving classes by providing instructors feedback of their teaching; and, serving as supporting documentation in a faculty’s reappointment, tenure and promotion.
*** Teacher evaluations also help the student make decisions about what courses and instructors are right for them.  Please let the students know that the teacher evaluation results are readily accessible to them at www.hunter.cuny.edu/myprof.
{| class="wikitable sortable mw-collapsible"
|- style="background-color:lightsteelblue;"
! Assignment #10. Due In-Class or 5/25
|- style="background-color:powderblue;"
| (10 pts) Using the "Prairie Home Companion" data set to answer the following questions
* Test the normality of response variable using <code>qqnorm</code> and <code>qqline</code>
* Make scatter plot with axes labels
* Log-transform the response variable and make a new scatter plot
* Perform regression analysis using <code>lm</code>
* What is the regression formula (intercept and slope? significance of each?)
* Calculate confidence interval of the population slope
* Obtain R-squared value and explain its meaning and significance
* Add confidence band line
* Add confidence interval line
|}


===May 25. Final Exam (11:10-1:40; Comprehensive)===
# create a function in which the first argument is the original dataset; second argument is number of shuffles
def shuffler2(original, n):
    record = []
    for i in range(n):
        num = 0
        each_shuffle = {}
        temp = original.reindex(np.random.permutation(original.index)) # do shuffling for each
        compare = original.num == temp.num
        matched_df = original.ix[compare[compare == True].index]
        for i in matched_df.index:
            each_shuffle[i] = matched_df.name[i]
        shuffle_record.append(each_shuffle)
        try:
            num = compare.value_counts()[1]
        except:
            pass
        record.append(num)
    return(record)
result2 = shuffler2(df, 1000)


===May 31. Grades submitted to Registrar Office===
plt.hist(result2, color="yellow")
plt.title("Histgram for President Data")
plt.show()
</syntaxhighlight>
* By Weigang
<syntaxhighlight lang="bash">
p <- read.table("Presidents.txt", sep="\t", header=F)
colnames(p) <- c("order", "name", "inaug.year")
# Use "sapply" or "lapply" for loops: no need to pre-define a vector to store results
p.sim <- sapply(1:10000, function (x) {
  length(which(sample(p$order) == p$order))
  })
barplot(table(p.sim)/1e4, las=1)
p.exp <- rpois(1e4, 1) # draw 10000 Poisson random deviates
mp <- barplot(table(p.exp)/1e4, las=1, xlab = "Num of matching presidents") # mid-point on x-axis
lines(mp[1:7], table(p.sim)/1e4, type="b", col=2) # add a line (Poisson-expected) to the barplot (simulated)
legend("topright", c("Simulated", "Poisson expectation"), col=1:2, lty=1)
</syntaxhighlight>
|}

Revision as of 17:28, 31 January 2022

Season V. Genes, Memes, and Machines (Spring 2022)

  • A journal club to continue the exploration of the link between evolution & learning.
  • A primer for information theory
  • Data ≠ Information: Scharf (2021). The Ascent of Information: Books, Bits, Genes, Machines, and Life's Unending Algorithms. Amazon link
  • The "It's the song not the singer" (ITSNTS) theory of selection unit: Doolittle & Inkpen (2018). "Processes and patterns of interaction as units of selection: An introduction to ITSNTS thinking", PNAS.
  • Evolution is an AI machine: Stanley, Lehman, and Soros (2017). "Open-endedness: The last grand challenge you’ve never heard of - While open-endedness could be a force for discovering intelligence, it could also be a component of AI itself." O'Reily
  • An algorithmic definition of individuality: Krakauer et al (2020). "The information theory of individuality". Theory BioSci
  • Evolution towards complexification: Krakauer (2011). "Darwinian demons, evolutionary complexity, and information maximization". Chaos.

Season IV. Classification using Machine Learning (Summer 2021)

  • Textbook: Aurélien Géron (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, 2nd Edition. Amazon link
  • Set up Python work environment with mini-conda
  • We will be using Jupyter-notebook (part of mini-codon installation) to share codes

Week 1. MNIST dataset (Chapter 3)

  1. Dataset loading and display (pg 85-87): Jackie, Niemah, and Hannah
  2. Binary classifier
    1. Cross-validation (pg 89-90): Roman, etc
    2. Confusion matrix (pg 90-92):
    3. Precision and recall (pg 92-97)
    4. ROC curve (pg. 97-100)
  3. Multiclass classification (pg 100-108): Brian, etc

Week 2. K-means clustering (Chapter 9)

  1. 2D simulated dataset
  2. Image recognition
  3. MNIST dataset

Week 3. Exercises (pg.275-276, Chapter 9)

  1. Exercise 10: Facial recognition (Olivetti faces dataset, with k-means)
  2. Exercise 11. Facial recognition (semi-supervised learning

Notes on Origin of Life (Spring 2020)

  1. Attie et al (2019). Genetic Code optimized as a traveling salesman problem
  2. GC Origin of Life Seminar (2/19/2021)
  3. Pressman et al (2015). Review: RNA World (ribozyme as origin of genetic code)
  4. A perspective: Miikkulainen, R., Forrest, S. A biological perspective on evolutionary computation. Nat Mach Intell 3, 9–15 (2021)

Season III. Summer 2018 (Theme: Evolutionary Computing)

Week 1. Introduction & Motivating Examples

  1. Genetic Arts
    1. Go to picbreeker website & evolve using "branch" method
    2. Evolve 3D art: Endless Forms
    3. CPPN-NEAT Algorithm: Compositional Pattern Producing Networks (CPPNs)-NeuroEvolution of Augmenting Topologies (NEAT); PicBreeder paper; or a PDF version
  2. NeuroEvolution (by DeepMind team)
  3. Robotic snake (and other soft robots)
    1. A controller built with physics laws difficult (too many parameters)
    2. Simulation with evolutionary computing:
      1. (Genotype) A list of 13 commands (one for each segment; each being a neural net, with 25 inputs and one output of joint angles)
      2. (Phenotype) Fitness function: total displacement
      3. Algorithm: MAP-Elites; Nature paper
      4. Approach: training with simulated data
  4. Robotic Knightfish
    1. Youtube demo
    2. Algorithm: CMA-ES (Covariance Matrix Adaptation Evolution Strategy)
      1. Genotype: 15 variables (sinusoidal wave function or Fourier Series)
      2. Phenotype/Fitness: speed
    3. Implementation: DEAP
  5. Compositional Protein design (CPD)
    1. Genotype: side-chain configuration determined by rotamer library
    2. Phenotype/Fitness: Rosetta energy function & functional (e.g., binding affinity)
    3. Fitness landscape: each node is a protein structure, each edge represent connections/relatedness
  6. Simulation-based optimization: Problem Sets
  7. Self-evolving software(Genetic Programming)

Week 2. Toy Problem: OneMax Optimization

  1. Problem: Create a list of L random bits (0 or 1). Evolve the list until the fitness reaches the maximum value (i.e., contains only 1's)
  2. Neutral Evolution:
    1. Create a vector of L random bits (e.g, L=20). Hint for creating a random vector of 0's and 1's in R: ind<-sample(c(0,1), prob=c(0.5.0.5), replace=T, size=20). Biologically, this vector represents a single haploid genome with 20 loci, each with two possible alleles (0 or 1).
    2. Create a population of N=100 such individuals. Hint: creating a list of vectors in R: pop <- lapply(1:100, function(x){<insert sample() function above>})
    3. For each generation, each individual reproduces 10 gametes with mutation (with the probability for bit flip: mu = 1/L = 0.1). Hint: write a mutation function that takes an individual as input and outputs a mutated gamete. Use the "broken stick" algorithm to implement mutation rate: cutoff <- runif(1); ifelse(cutoff <= mu, flip-bits, no-flip)
    4. Take a random sample of N=100 gametes into the next generation & repeat above
    5. Plot mean, minimum, and maximum fitness over generations
    6. Plot diversity statistics over generation (including average allelic heterozygosity per locus as well as haplotype heterozygosity). Hint: write two functions for these heterozygosity
  3. Add natural selection (proportional scheme)
    1. Individuals reproduce with fitness proportional to the total number of 1's.
    2. Iterate until a population contains at least one individual with all 1's
    3. Plot mean, minimum, and maximum fitness over generations
    4. Plot diversity (expectation: decreasing)
  4. Add natural selection (tournament scheme)
    1. Randomly selecting n=5 individuals and allow the fittest one to make gametes
    2. Plot mean, minimum, and maximum fitness over generations (Expectation: faster optimization)
    3. Plot diversity (expectation: decreasing fasters)
  5. Add crossover
    1. Hint: write a crossover function
    2. Does it reach optimization faster?
  6. Code submissions
    1. Python Notebook by John
    2. R code by Weigang
    3. R code by Panlasigui
    4. Python Notebook by Yinheng
    5. RPub by Desiree

Week 3. Python & R Packages for evolutionary computation

Week 4. Multiplex Problem

Week 5. Genetic Programming

Season II. Summer 2017 (Theme: Machine Learning)

Week 1. Introduction & the backprop algorithm

Iris-box3.png
Iris-box4.png
  1. Problem: Classification/Clustering/Predication of flower species (a total of 3 possible species in the sample data set) based on four phenotypic traits/measurements
  2. The "iris" data set: exploratory data analysis with visualization & descriptive statistics: data("iris"); plot(), summary(), qqnorm(); qqline(); hist(); normalization with scale() (Roy)
  3. Mathematics of backpropagating errors to neural connections (Oliver)
    1. Objective/optimization function measuring difference between target (t, expected) and neural activity y: G=Sum{t*log(y)+(1-t)log(1-y)} (this is known as the "cross-entropy" error function; the other alternative is "minimal squared error (MSE)"), which has the gradient in the simple form of g=-(t-y)x, where x is the input. The objective function is minimized when weights are updated by the gradient. Error-minimization by MSE has similar effects but harder to calculate.
    2. Learning algorithm is presented

Week 2. Traditional approaches to multivariate clustering/classification

  1. Dimension reduction with Multidimensional Scaling cmdscale() Mei's rNoteBook.
  2. Dimension reduction with Principal Component Analysisprincomp(). Sipa's rNoteBook
  3. Multivariate clustering with Hierarchical Clustering hclust() (Saymon)
  4. Multivariate clustering with k-means kmeans() Roy's rNoteBook
  5. Classification based on logistic regression glm(), linear discriminatory analysis lda(), & k-nearest neighbor library(class); knn() (John)
  6. Modern, non-linear classifiers: Decision Trees (DT), Support Vector Machines (SVM), and Artificial Neural Networks (ANN) (Brian)

Week 3. Single-neuron classifier

  1. Algorithm:
    Ml-image-1a.jpg
    1. Read input data with N=150 flowers. Reduce to a matrix x with two traits (use trait 1 & 3) and two species (use rows 51-150, the last two species, skip the first species [easy to separate]) for simplicity. Create a target vector t <- c(rep(0,50),rep(1,50)) indicating two species
    2. Initialize the neuron with two random weights w<-runif(2, 1e-3, 1e-2) and one random bias b<-runif(1)
    3. Neuron activation with k=2 connection weights: a=sum(x[k] * w[k])
    4. Neuron activity/output: y=1/(1+exp(-a-b)) (This logistic function ensures output values are between zero and one)
    5. Learning rules: learning rate eta=0.1, backpropagate error ("e") to get two updated weights (for individual i, feature k): e[i]=t[i]-y[i]; g[k,i]= -e[i] * x[k,i]; g.bias[i] = -e[i] for bias; Batch update weights & bias: w[k]=w[k] - eta * sum(g[k,i]); b = b - eta * sum(g.bias[i]) (same rule for the bias parameter b); Repeat/update for L=1000 epochs (or generations)
    6. Output: weights and errors for each epoch
  2. Evaluation:
    1. Plot changes of weights over epoch
    2. Use the last weights and bias to predict species
    3. Compare prediction with target to get accuracy
    4. Plot scatter plot (x2 vs x1) and add a line using the weights & bias at epoch=1,50,100,200,500, 1000: a=w1*x1 + w2*x2 + b; a=0. The lines should show increasing separation of the two species.
  3. Code submissions
    1. Roy: rPubs notebook
    2. John: Python code on github
    3. Mei: rPubs notebook
    4. Brian: rPubs notebook
    5. Weigang: rPubs notebook
  4. Questions for future exploration
    1. How to avoid over-fitting by regularization (a way to penalize model complexity by adding a weight decay parameter alpha)
    2. Bayesian confidence interval of predictions (with Monte Carlo simulation)
    3. Limitations: equivalent to PCA (linear combination of features); adding a hidden layer generalize the neural net to be a non-linear classifier

Week 4. Single-layer, multiple-neuron learner

Multiple-neuron-learner.png
  1. Predict all three species (with "one-hot" coding) using all four features: e.g., 100 for species 1, 010 for species 2, and 001 for species 3. Create 3 neurons, each one outputting one digit.
  2. Use three neurons, each accepts 4 inputs and output 1 activity
  3. Use softmax to normalize the final three output activities
  4. Code submissions
    1. John:
      1. reference this webpage
      2. Python code
      3. TensorFlow code
    2. Roy: rpubs notebook
    3. Mei: rPubs Notebook
    4. Weigang: rPubs notebook
  5. Challenge: Implementation with TensorFlow, an Open Source python machine learning library by Google Deep Mind team. Follow this example of softmax regression
  6. A biological application for the summer: identify SNPs associated with biofilm/swarming behavior in Pseudomonas

Week 5. Multi-layer ("Deep") Neuron Network

Iris-nnet.png
An investigation of under-fitting (at N=1) & over-fitting (at N=3 nodes). N=0 implies linear fitting.
  1. Input layer: 4 nodes (one for each feature)
  2. Output layer: 3 nodes (one for each species)
  3. Hidden layer: 2 hidden nodes
  4. Advantage: allows non-linear classification; using hidden layers ("convoluted neural net") is able to capture high-order patterns
  5. Implementation: I'm not going to hard-code the algorithm from scratch. The figure was produced using the R package with the following code: library(nnet); library(neuralnet); targets.nn <- class.ind(c(rep("setosa",50), rep("versicolor",50), rep("virginica",50))) # 1-of-N encoding; iris.net <- neuralnet(formula = setosa + versicolor + virginica ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = training.iris, hidden = 2, threshold = 0.01, linear.output = T); plot(iris.net, rep="best"). It achieved 98.0% accuracy.
  6. Deep neural net allows non-linear, better fitting, but we don't want over-fitting by adding more hidden layers (or more neurons in the hidden layer). Identify under- and over-fitting with the following procedure:
    1. Randomly sample 100 as training set and the remaining as target. Repeat 100 times
    2. For each sample, plot accuracy for the training set, as well as accuracy for the target
    3. Find the point with the right balance of under- and over-fitting

Week 6. Unsupervised Neural Learner: Hopfield Networks

  1. Hebbian model of memory formation (MacKay, Chapter 42). Preparatory work:
    1. Construct four memories, each for a letter ("D", "J", "C", and "M") using a 5-by-5 grid, with "-1" indicating blank space, and "1" indicating a pixel. Flatten the 5-by-5 matrix to a one-dimensional 1-by-25 vector (tensor).
    2. Write two functions, one to show a letter show.letter(letter.vector), which draws a pixel art of letters (print a blank if if -1, a "x" if 1), another to mutate the letter mutate(letter.vector, number.pixel.flips)
  2. Hopfield Network
    1. Store the four memories into a weight matrix, which consists of symmetric weights between neurons i and neuron j, i.e., w[i,j] = w[j,i]. We will use a total of 25 neurons, one for each pixel.
    2. First, combine the four vectors (one for each letter) into a matrix: x <- matrix(c(letter.d, letter.j, letter.c, letter.m), nrow = 4, byrow = T)
    3. Second, calculate weight matrix by obtaining the outer product of a matrix multiplication: w <- t(x) %*% x. Set diagonal values to be all 0's (i.e., to remove all self-connections): for (i in 1:25) { w[i,i] = 0 }. This implements Hebb learning, which translates correlations into strength of connections quantified by weights: large positive weights indicate mutual stimulation (e.g., 1 * 1 = 1 [to wire/strengthen the connection of co-firing neurons, and ...], -1 * -1 = 1 [to wire/strengthen the connection of co-inhibitory neurons as well]), large negative weights indicate mutual inhibition (e.g., 1 * -1 = -1 [to unwire/disconnect oppositely activated neurons]), and small weights indicate a weak connection (e.g., 0 * 1 = 0 [to weaken connections between neurons with uncorrelated activities, but do not unwire them]). (0, 1, and -1 being values of neuron activities)
    4. Third, implement the learning rule by updating activity for each neuron, sequentially (asynchronously): a[i] <- sum(w[i,j] * x[j])
    5. Iterate the previous step 1-5 times, your code should be able to (magically) restore the correct letter image even when the letter is mutated by 1-5 mutations in pixels. This exercise simulates the error-correction ability of memory (e.g., self-correcting encoding in CDs, a neon-light sign missing a stroke, or our ability to read/understand/reconstruct texts with typos, e.g., you have no problem reading/understanding this sentence: "It deosn’t mttaer in what order the ltteers in a word are, the olny iprmoatnt thing is taht the frist and lsat ltteer be in the rghit pclae.").
    6. Expected capacity of a Hopfield Network: number_of_memories = 0.138 * number_of_neurons. In our case, 25 (neurons) *0.138 = 3.45 memorized letters. So the network/memory fails if we squeeze in one additional letter (try it!).
  3. Code submissions
    1. Roy: rPubs Notebook
    2. John: Python Code on Github
    3. Weigang: rPubs Notebook
  4. Potential biological applications: genetic code optimization; gene family identification

Week 7. Biological Applications

A conceptual map for choosing ML algorithms:

Conceptual map from SciKit-Learn website
  • The Python SciKit Learn framework may be tool of choice (instead of R). See these nice examples
  • Bayesian Network (using e.g., R Package bnlearn) to identify cell-signaling networks. Sache et al (2005). Causal Protein-Signaling Networks Derived from Multiparameter Single-Cell Data
  • Identify proteins associated with learning in mice (Classification & Clustering)UCI Mice Protein Expression Dataset
  • Predict HIV-1 protease cleavage sites (Classification): UCI HIV-1 Protease Dataset
  • Lab project 1. Identify SNPs in 50 c-di-GMP pathway genes that are associated with c-di-GMP levels, swarming ability, and biofilm ability in 30 clinical isolates of Pseudomonas aeruginosa. Approach: supervised neural network (with regularization)
  • Lab project 2. Identify genetic changes contributing to antibiotic resistance in 3 cancer patients. Approach: whole-genome sequencing followed by statistical analysis
  • Lab project 3. Simulated evolution of genetic code. Approach: Multinomial optimization with unsupervised neural network

Summer Project 1. Systems evolution of biofilm/swarming pathway (with Dr Joao Xavier of MSKCC)

Fig.1 .Simulated CDG-correlated SNPs. t-test results: (1) cor=0.8 (strong), t=-5.1142, df=95.596, p=1.619e-06; (2) cor=0.5 (medium), t=-4.7543, df=85.796, p=7.953e-06; (3) cor=0.2 (weak), t=-0.94585, df=79.28, p= 0.3471.
Fig.2. Simulated tree-based SNPs. Generated with the APE function: replicate(10, rTraitDisc(tr, states = c(0,1), rate = 100, model = "ER"))
  1. Acknowledgement: NSF Award 1517002
  2. Explanatory variable (genotypes): Whole-genome data (of ~30 clinical isolates & many experimentally evolved strains) as independent variables
  3. Explanatory variable (genotypes): ~50 genes related to cyclic-di-GMP synthesis and regulation (~8000 SNPs, ~1700 unique, ~5-8 major groups)
  4. Response variables (phenotypes): biofilm size, swarming size, antibiotic sensitivity profile, metabolomics measurements, c-di-GMP levels
  5. Sub-project 1: Database updates (Usmaan, Edgar, Christopher; continuing the work of Rayees and Raymond in previous years)
    1. c-di-GMP pathway SNPs in the new table "cdg_snp"
    2. c-di-GMP levels in "phenotype" table
    3. capture fig orthologs (in progress)
    4. capture matebolomics data with a new table (in process)
      1. a heatmap of deviation from mean (among strains of the same metabolite)
      2. a volcano plot (p values based on t-test from group mean among strains of the same metabolite)
  6. Sub-project 2. Supervised learning for predicting genes, SNPs associated with phenotypes
    1. Advantages over traditional regression analysis: ability to discover non-linear correction structure
    2. Challenges:
      1. over-fitting (we have only ~30 observations while thousands of SNPs, "curse of dimensionality")
      2. the "effective" sample size is further discounted/reduced by phylogenetic relatedness among the strains.
    3. Single neuron, two-levels; by SNP groups (using hclust()); by PCA (linear combination of SNPs); or by linkage groups (looking for homoplasy) (Mei), or by t-SNE (as suggested by Rayees)
    4. Three neurons, three-levels; by gene (Roy)july-14_nnet_cidigmp.R
    5. To Do: (1) predict continuous-value targets; (2) add hidden layers; (3) correct for phylogenetic auto-correlation
  7. Sub-project 3. Generate simulated data (with Choleski Decomposition) (Weigang; continue the work in previous year by Ishmere, Zwar, & Rayees)

Fig.1. Simulated SNPs associated with cdg levels (w/o tree) inspired by this algorithm

# simulate SNP-cdg correlation
r <- 0.2 # desired correlation coefficient
sigma <- matrix(c(1,r,r,1), ncol=2) # var-covariance matrix
s <- chol(sigma) # choleski decomposition
n <- 100 # number of random deviates (data points)
z <- s %*% matrix(rnorm(n*2), nrow=2) # 100 correlated normally distributed deviates with cor(x,y)=r
u <- pnorm(z) # get probabilities for each deviates
snp.states <- qbinom(u[1,], 1, 0.5) # discretize the 1st vector of probabilities into 0/1 with Bernoulli trial
idx.0 <- which(snp.states == 0); # indices for "0"
idx.1 <- which(snp.states == 1); # indices for "1"
# boxplots with stripcharts
boxplot(u[2,] ~ snp.states, main="cor=0.2", xlab="SNP states", ylab="CDG level")
stripchart(u[2,] ~ snp.states, vertical=T, pch=1, method="jitter", col=2,  add=T)

Fig.2. Simulated SNPs associated with strain phylogeny

# Simulate SNPs on a tree
tr <- read.tree("cdg-tree-mid.dnd") # mid-point rooted tree
X <- replicate(10, rTraitDisc(tr, states = c(0,1), rate = 100, model = "ER"))
id <- read.table("cdg.strains.txt3", row.names = 1, sep="\t")
par.tr <- plot(tr, no.margin = T, x.lim = 0.03, show.tip.label = F)
text(rep(0.013,30), 1:30, id[tr$tip.label,1], pos = 4, cex=0.75)
text(rep(0.02,30), 1:30, snps[tr$tip.label], pos=4, cex=0.75)
add.scale.bar()
abline(h=1:30, col="gray", lty=2)

To Do/Challenges: (1) simulate multiple SNPs (linear correlation); (2) simulate epistasis (non-linear correlation); (3) simulate phylogenetic correlation

  • Simulation & xgboost code contributed by Mei & Yinheng (January 2018)
#To create a simulated matrix w/ correlated xy variables; NOTE: ONLY 1 correlated x variable
get1Simulated <- function(corco, nstrains, snps){
  r <- corco/10 # desired correlation coefficient
  sigma <- matrix(c(1,r,r,1), ncol=2) # var-covariance matrix
  s <- chol(sigma) #cholesky decomposition
  n <- nstrains
  z <- s %*% matrix(rnorm(n*2), nrow=2)
  u <- pnorm(z[2,])
  snp.states <- qbinom(u, 1, 0.5)

  known <- t(rbind(z[1,],snp.states))
  rand <- pnorm(matrix(rnorm(nstrains*(snps-1)),nrow = nstrains))
  snp.rand<-qbinom(rand,1,0.5)
  x <- cbind(known, snp.rand)

  colnames(x)[1] <- "target"
  colnames(x)[-1] <- paste("feature", seq_len(ncol(x)-1), sep = ".")

  return(x)
}

#Use this function to artificially create 2 x variables that have relationship w/ Y.
getSimulated.chol <- function(snp1, snp2, nstrains, snps){
  n <- nstrains

  btwn <- snp1 * snp2 #this is a suggested value for inter-SNP correlation. the covariance matrix has to be semi-positive infinite in order to carry out the decomposition

  covar.m <- matrix(c(1.0, snp1, snp2, snp1, 1.0, btwn, snp2, btwn, 1.0), nrow = 3) #cor-matrix

  s <- chol(covar.m)
  z <- s %*% matrix(rnorm(n*3), nrow = 3)

  #decretize the 2 variables
  u <- t(pnorm(z[2:nrow(z),]))
  decretize.u <-qbinom(u,1,0.5)

  #generate some random discrete variables and combine w/ the 2 correlated variables
  rand <- matrix(rnorm(nstrains*(snps-2)),nrow = nstrains)
  rand.u <- pnorm(rand)
  snp.rand <- qbinom(rand.u,1,0.5)
  x <- cbind(z[1,], decretize.u, snp.rand)

  dimnames(x) <- list(c(), c("target", paste("feature", seq_len(ncol(x)-1), sep = ".")))

  return(x)
}

#This utilizes xgboost function w/o cross-validating the strains and w/ tuned parameters (we think it's the optimal set) tested by Yinheng's python [https://github.com/weigangq/mic-boost/blob/master/micboost/__init__.py getBestParameters] function. Please make sure to have xgboost installed. 
runXg <- function(x){
  require(xgboost)

  bst <- xgboost(data = x[,2:ncol(x)], label = x[,1], max.depth = 2, eta = .05, gamma = 0.3, nthread = 2, nround = 10, verbose = 0, eval_metric = "rmse")

  importance_matrix <- xgb.importance(model = bst, feature_names = colnames(x[,2:ncol(x)]))

  return(importance_matrix)
}


#Same as above but w/ cross validation.
runXg.cv <- function(x){
  require(caret)
  require(xgboost)

  ind <- createDataPartition(x[,1], p = 2/3, list = FALSE )

  trainDf <- as.matrix(x[ind,])
  testDf <- as.matrix(x[-ind,])

  dtrain <- xgb.DMatrix(data = trainDf[,2:ncol(trainDf)], label = trainDf[,1])
  dtest <- xgb.DMatrix(data = testDf[,2:ncol(testDf)], label = testDf[,1])

  watchlist <- list(train=dtrain, test=dtest)

  bst <- xgb.train(data=dtrain, nthread = 2, nround=10, watchlist=watchlist, eval.metric = "rmse", verbose = 0)

  importance_matrix <- xgb.importance(model = bst, feature_names = colnames(x[,2:ncol(x)]))

  return(importance_matrix)

}

#This function returns a master table with different iterated values that finds the best number of predictor variables (x).
getTable <- function(x, importance_matrix, cor){
  # v.names <- c("feature", "SNPs", "strains", "cor_given", "cor_actual", "gain", "cover", "rank")
  v.names <- c("feature", "SNPs", "strains", "cor_given", "cor_actual", "p.value", "gain", "cover", "rank")
  for (i in 1:length(v.names)){
    assign(v.names[i], numeric())
  }
  featNum <- NULL

  n <- 0
  new_cor <- rep(cor, length(x)/length(cor))
  for (num in 1:length(x)) {
    featNum <- NULL
    imp_matrix <- importance_matrix[[num]]
    x.iter <- x[[num]]
    x.pv <- getPV(x[[num]])

    for (i in 1:2){
      featNum <- c(featNum, grep(paste("feature.",i,"$", sep = ""), imp_matrix$Feature, perl = TRUE))
    }

    n <-  n + 1

    for(i in 1:2){
      f <- featNum[i]
      feature<-c(feature, imp_matrix$Feature[f])
      SNPs <-c(SNPs, 100)
      strains <- c(strains, nrow(x.iter))
      cor_given <- c(cor_given, new_cor[n])
      cor_actual <-c(cor_actual, cor(x.iter[,1],x.iter[,grep(paste(imp_matrix$Feature[f], "$", sep = ""), colnames(x.iter), perl = TRUE)]))
      p.value <- c(p.value, x.pv[,imp_matrix$Feature[f]])
      gain <- c(gain, imp_matrix$Gain[f])
      cover <- c(cover, imp_matrix$Cover[f])
      rank <- c(rank, f)
    }
  }

  x.master <- data.frame(feature, SNPs, strains, cor_given, cor_actual, p.value, gain, cover, rank)

  return(x.master)
}

Summer Project 2. Whole-genome variants associated with multidrug resistance in clinical Pseudomonas & E.coli isolates (with Dr Yi-Wei Tang of MSKCC)

  1. Acknowledgement: Hunter CTBR Pilot Award
  2. Stage 1: Select patients & strains for whole-genome sequencing by MiSeq, based on drug sensitivities (16 isolates from 5 patients were isolated, tested, and selected by April, 2017)
  3. Stage 2: Genome sequencing (FASTQ files generated, 3 replicates for each isolates; done by June 2017)
  4. Stage 3: Variant call
    1. Reference strains identified using Kraken (Roy)
    2. VCF generation using cortex_var (Michele and Hanna, led by John)
  5. Stage 4: Variant annotation
  6. Stage 5: Statistical analysis
  7. Stage 6: Web report

Summer Project 3. Origin of Genetic Code

Tools for testing SGC & evolved codes

  • Shuffle code:
./shuffle-code.pl
    -f <sgc|code-file> (required)
    -s <1|2|3|4|5> (shuffle by 1st, 2nd, 3rd, aa blocks, and all random)
    -p(olarity; default)
    -h(ydropathy)
    -v(olume)
    -e(iso-electricity)
# Input: 'sgc' (built-in) or an evolved code file consisting of 64 rows of "codon"-"position"
# Output: a code file consisting of 64 rows of "codon" - "aa" (to be fed into ./code-stats.pl)
# Usage examples:
./shuffle-code.pl -f 'sgc' -s 5 # randomly permute SGC
./shuffle-code.pl -f 'evolved-code.txt' -p # evolved code, AA assigned according to polarity (default)
./shuffle-code.pl -f 'evolved-code.txt' -s 5 -h # evolved code, AA assigned according to hydrophobicity, shuffled randomly
  • Code statistics
./code-stats.pl
        [-h]    (help)
        [-f 'sgc (default)|code_file']  (code)
        [-s 'pair (default)|fit|path']  (stats)
        [-p 'pol|hydro|vol|iso', default 'grantham']    (aa prop)
        [-i (ti/tv, default 5)]
        [-b: begin codon ('TTT') -q: panelty for 1st (50); -w: panelty for 2nd (100)]   (options for path)
# Input: 'sgc' (built-in) or a code file consisting of 64 rows of "codon" - "aa"
# Output: codon or code statistics
# Usage examples:
./code-stats.pl # all defaults: print single-mutation codon pairs, grantham distance, for SGC
./code-stats.pl -s 'fit' -p 'pol' # print code fitness according to polarity, for SGC (default)
./code-stats.pl -s 'path' # print tour length for SGC
  1. Participants: Oliver, John, Brian
  2. A representation mimicking TSP (see figure at right)
  3. Code to calculate total (mutational, ti/tv) path of SGC?
  4. Code to calculate energy of SGC?
  5. Randomize amino acid assignment and obtain distributions of path & energy
  6. Code to minimize total path length and energy?
  7. How to evolve a robust SGC? mutation bias + polarity/hydropathy + usage

Season I. Spring 2017 (Themes: Dueling Idiots/Digital Dice/Wright Fisher Process)

"Coalescence" (Backward simulation of Wright-Fisher process) (Due May 12, 2017)

Coalescent-output-1.png

We will conclude Spring 2017 season with the coalescence simulation (in summer, we will start experimenting with simulation of systems evolution)

Previously, we simulated Wright-Fisher process of genetic drift (constant pop size, no selection) starting from a founder population and end with the present population. This is not efficient because the program has to track each individual in each generation, although the majority of them do not contribute to the present population.

Coalescence simulation takes the opposite approach of starting from the present sample of k individuals and trace backward in time to their most recent common ancestor (MRCA). Due to the nature of Poisson process, the waiting time from one coalescence event (at time T) to the next one (at time T-1) is exponentially distributed with a mean of (k choose 2) generations. This process is iterated until the last coalescent event.

The classic text on coalescence is Richard Hudson's chapter, which includes (in Appendix) C codes for simulating tree and mutations. Hudson is also the author of ms, the widely used coalescence simulator.

Like the ms 10 1 -T command, your code should output a tree of 10 individuals. [A lot tougher than I thought; can't figure it out in R; so I did it in Perl]. Similarly, you could use R, where the ape package has a function called rcoal() that generate a random coalescence tree. Try this command: plot(rcoal(10)) (first load library by running library(ape)).

Submitted Codes
  • By Weigang (First draft)
#!/usr/bin/env perl
# Basic coalescence
# First, simulate coalescent events with recursion
# Second, use Bio::Tree to export a newick tree
use strict;
use warnings;
use Data::Dumper;
use Algorithm::Numerical::Sample  qw /sample/;
use Math::Random qw(random_exponential);
use Bio::Tree::Tree;
use Bio::Tree::Node;

######################
# Initialize
######################
die "Usage: $0 <num-of-samples>\n" unless @ARGV == 1;
my $nsamp = shift @ARGV;
my @samples;
for (my $i=1; $i<=$nsamp; $i++) { push @samples, {id=>$i, parent=>undef, br=>0} }
my $ctr = $nsamp;
my $time = 0;
my @all_nodes;

################################################################################
# Simulate events with exponential time intervals between two successive events
#################################################################################
&coal(\@samples, \$ctr, \$time);
#print Dumper(\@all_nodes);
#########################################
# Reconstitute into tree using Bio::Tree
##########################################
my %seen_node;
my $max_id=0;
my @nodes;
foreach (@all_nodes) {
    $max_id = ($_->{parent} > $max_id) ? $_->{parent} : $max_id;
    if ($seen_node{$_->{id}}) { # child exist, previously as a parent, no branch length
        $seen_node{$_->{id}}->branch_length(sprintf "%.6f", $_->{br}); # add branch length
        if ($seen_node{$_->{parent}}) { # parent exist
            $seen_node{$_->{parent}}->add_Descendent($seen_node{$_->{id}});
        } else { # parent new
            my $pa = Bio::Tree::Node->new(-id=>$_->{parent});
            $pa->add_Descendent($seen_node{$_->{id}});
            $seen_node{$_->{parent}} = $pa;
        }
    } else { # child new
        my $new = Bio::Tree::Node->new(-id=>$_->{id}, -branch_length => sprintf "%.6f", $_->{br});
        $seen_node{$_->{id}} = $new;
        if ($seen_node{$_->{parent}}) { # parent exist
            $seen_node{$_->{parent}}->add_Descendent($new);
        } else { # parent new
            my $pa = Bio::Tree::Node->new(-id=>$_->{parent});
            $pa->add_Descendent($new);
            $seen_node{$_->{parent}} = $pa;
        }
    }
}

#my $root = $seen_node{$max_id};
my $tree=Bio::Tree::Tree->new(-id=>'coal-sim', -node=>$seen_node{$max_id}, -nodelete=>1);
print $tree->as_text("newick"), "\n";

exit;

sub coal {
    my $ref_nodes = shift;
    my $ref_ct = shift;
    my $ref_time = shift;
    my $ct = $$ref_ct;
    my @current_nodes = @$ref_nodes;
    my $k = scalar @current_nodes;
    my @new_nodes;
    return unless $k > 1;
    my @pair = sample(-set => $ref_nodes, -sample_size => 2);
#    print $ct, "\t", $$ref_time, "\t", $pair[0]->{id}, "\t", $pair[1]->{id}, "\n";
    $$ref_time += random_exponential(1, 2/$k/($k-1));
    my $new_nd = {id=>$ct+1, parent=>undef, br=>$$ref_time};
    map {$_->{parent} = $ct+1} @pair;
    map {$_->{br} = $$ref_time - $_->{br}} @pair;
    push @all_nodes, $_ for @pair;
    foreach (@current_nodes) {
        push @new_nodes, $_ unless $_->{id} == $pair[0]->{id} || $_->{id} == $pair[1]->{id};
    }
    push @new_nodes, $new_nd;
    $$ref_ct++;
    &coal(\@new_nodes, $ref_ct, $ref_time);
}
  • By John
from Bio import Phylo
from io import StringIO
from random import sample
from scipy.misc import comb
from itertools import combinations
from numpy.random import exponential as exp

# Generate Random Tree with Nodes & Waiting Time
individuals = ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J"]
node_dict = {}
reference = {}
wait_time_ref = {}
node = 0
while len(individuals) != 1:
    node_dict[node] = {"child_nodes": []}
    sample_events = tuple(sample(individuals, 2))
    reference[sample_events] = node
    wait_time = exp(1/comb(len(individuals), 2))
    wait_time_ref[node] = wait_time
    for sample_event in sample_events:
        if type(sample_event) == str:
            node_dict[node]["child_nodes"].append(sample_event)
        else:
            child_node = reference[sample_event]
            node_dict[node]["child_nodes"].append(child_node)
    individuals.remove(sample_events[0])
    individuals.remove(sample_events[1])
    individuals.append(sample_events)
    node += 1
    
# Calculate Tree Branch Lengths
tree_dict = {}
cumulative_br_length = 0
cumulative_br_len_dict = {}
for i in range(len(node_dict)):
    tree_dict[i] = {}
    cumulative_br_length += wait_time_ref[i]
    cumulative_br_len_dict[i] = cumulative_br_length
    for node in node_dict[i]['child_nodes']:
        if type(node) == str:
            tree_dict[i][node] = cumulative_br_length
        else:
            tree_dict[i][node] = cumulative_br_len_dict[i] - cumulative_br_len_dict[node]
            
# Parse the Tree into a String
for i in range(len(tree_dict)):
    for node in tree_dict[i].keys():
        if type(node) != str:
            temp = str(tree_dict[node])
            tree_dict[i][temp] = tree_dict[i].pop(node)
tree_str = str(tree_dict[i]).replace('\'', '').replace('\"', '').replace('\\', '').replace('{', '(').replace('}', ')').replace(' ', '')
tree_str += ";"            

# Visualize Tree
handle = StringIO(tree_str)
tree = Phylo.read(handle, 'newick')
tree.ladderize()   # Flip branches so deeper clades are displayed at top
Phylo.draw(tree)

April 17, 2017 "Mutation Meltdown" (Due May 5, 2017)

Ratchet-1.png

Our last exploration showed that population will reach a steady-state level of DNA sequence polymorphism under the opposing forces of genetic drift and mutations. The steady-state level is expected to be ~ N * mu: the larger the population size and the higher the mutation rate, the higher per-site DNA polymorphism. (Also note that, given a long enough sequence and a low enough mutation rate, we don't expect to see two or more mutations hitting a single position. Each mutation is essentially a new one and only two-state SNPs are expected in a sample of DNA sequences).

However, the steady-state expectation is based on the assumption that all mutations are neutral (i.e., no beneficial or harmful fitness effects). In reality, mutations are predominantly either neutral or harmful and few are beneficial. Natural selection (negative or positive, except those on the immune-defense loci) tends to drive down genetic variation.

Sex to the rescue. Without sex or recombination, the fate of genetic variations across a genome are bundled together, rising or sinking in unison with the fate of a single beneficial or harmful mutation. With recombination, genetic variations at different loci become less tightly linked and are more likely maintained, speeding up adaptation.

This week, we will use simulation to recreate the so-called "Muller's Ratchet", which predicts that asexual populations are evolutionary dead ends. It shows that the combined effect of deleterious mutation and genetic drift will lead to population extinction due to steady accumulation of deleterious mutations, a phenomenon called "mutation meltdown".

The simulation protocol would be similar to that of the last problem involving only mutation and drift, but I suggest you rewrite from scratch to be more efficient by using a simpler data structure (no sequence or bases needed). The main difference is that, instead of calculating average pairwise sequence differences in each generation, you will track the frequency of mutation-free sequences (n0) and find the time until it goes to zero. That is when the Ratchet makes a "click". The next click is when the population losses the one-mutation sequences (n1), and so on. Each click drives the population fitness down by one mutation-level. [ Numerically, if each mutation causes a fitness loss of s ("selection coefficient"), the fitness of a sequence with k mutations is given by w = (1-s)k. Strictly speaking, the selection coefficient should be included to affect gamete size, but let's ignore that for now].

If there is recombination, the mutation-free sequences could be recovered (simulation next time?). Without recombination, the only direction for the population to evolve is a steady loss of best-fit individuals (by genetic drift).

Questions:

  1. Does the ratchet occur faster in small or large populations? Find by simulating N=100 and N=1000
  2. Does the ratchet occur faster for a short or long genome? Find by simulating L=1e3 and L=1e4
  3. Which parts of the human genomes are asexual (therefore subject to Muller's Ratchet and becoming increasingly small)?
Submitted Codes
  • By John
import numpy as np
import matplotlib.pyplot as plt
from numpy.random import poisson
from numpy.random import choice

def simulator(seq_length, pop, repro_rate, generations):
    mutation_rate = 0.00001
    lam = seq_length * mutation_rate
    individuals = np.array([1 for i in range(pop)])
    non_mutated_pop_rate = []
    for generation in range(generations):
        gametes = []
        for individual in individuals:
            off_sprs = poisson(lam, repro_rate)
            mutation_ix = np.array(np.where(off_sprs != 0))
            indiv_gametes = np.array([individual for j in range(repro_rate)])
            indiv_gametes[mutation_ix] = 0
            gametes += list(indiv_gametes)
        individuals = choice(gametes, size=pop)
        non_mutated_frequency = float(np.count_nonzero(individuals)) / pop
        non_mutated_pop_rate.append(non_mutated_frequency)
    return non_mutated_pop_rate

genome_length_1000 = 1000
genome_length_10000 = 10000
results_1000 = simulator(genome_length_1000, 1000, 100, 500)
results_10000 = simulator(genome_length_10000, 1000, 100, 500)

genome_length_1000 = 1000
genome_length_10000 = 10000
results2_1000 = simulator(genome_length_1000, 10000, 100, 500)
results2_10000 = simulator(genome_length_10000, 10000, 100, 500)

f, (ax1, ax2) = plt.subplots(1, 2, sharey=True)
line_1_1000 = ax1.plot(results_1000, "r")
line_1_10000 = ax1.plot(results_10000, "b")
line_2_1000 = ax2.plot(results2_1000, "r")
line_2_10000 = ax2.plot(results2_10000, "b")

ax1.set_title("Population = 1000")
ax2.set_title("Population = 10000")

plt.title("Mutation Meltdown")
plt.show()
  • By Weigang
add.mutation <- function(genome, genome.length, mutation.rate) {
  mu.exp <- genome.length * mutation.rate;
  mu.num <- rpois(1, lambda = mu.exp);
  return(genome+mu.num);
}

ratchet <- function(pop.size=100, gamete.size=100, mu=1e-4, genome.length=1e4) {
  out <- data.frame(generation=numeric(), n0=numeric(), n1=numeric(), pop=numeric(), length=numeric());
  pop <- rep(0, pop.size) # initial all mutation-free
  g <- 1; # generation
  freq0 <- 1; # freq of zero-class
  while(freq0 > 0) {
    cat("at generation", g, "\n");
    freq0 <- length(which(pop == 0));
    out <- rbind(out, data.frame(generation=g, n0=freq0/pop.size, pop=pop.size, length=genome.length));
    gametes <- sapply(1:gamete.size, function(x) {lapply(pop, function(x) add.mutation(x, genome.length, mu))});
    pop <- sample(gametes, pop.size);
    g <- g+1;
  }
  return(out);
}

out.long.genome.df <- ratchet(genome.length=1e4);
out.short.genome.df <- ratchet(genome.length=1e3);

March 31, 2017 "Drift-Mutation Balance" (4/14/2017)

Drift-mutation.png

We will explore genetic drift of a DNA fragment with mutation under Wright-Fisher model. From last week's exercise, we conclude that a population will sooner or later lose genetic diversity (becoming fixed after ~2N generations), if no new alleles are generated (by e.g., mutation or migration).

Mutation, in contrast, increases genetic diversity over time. Under neutrality (no natural selection against or for any mutation, e.g., on an intron sequence), the population will reach an equilibrium point when the loss of genetic diversity by drift is cancelled out by increase of genetic diversity by mutation.

You job is to find this equilibrium point by simulation, given a population size (N) and a mutation rate (mu). The expected answer is pi=2N*mu, where pi is a measure of genetic diversity using DNA sequences, which is the average pairwise sequence differences within a population.

Note: the following algorithm seems to be too complex to build at once. Also, too slow to run in R. Nonetheless, please try to write two R functions: (1)mutate.seq(seq, mut), with "seq" as a vector of bases and "mut" is the mutation rate, and (2) avg.seq.diff(pop), with "pop" as a list of DNA sequences

A suggested algorithm is:

  1. Start with a homogeneous population with N=100 identical DNA sequences (e.g., with a length of L=1e4 bases, or about 5 genes) [R hint: dna <- sample(c("a", "t", "c", "g"), size=1e5, replace=T, prob = rep(0.25,4))]
  2. Write a mutation function, which will mutate the DNA based on Poisson process (since mutation is a rare event). For example, if mu=1e-4 per generation per base per individual (too high for real, but faster for results to converge), then each generation the expected number of mutations would be L * mu = 1 per individual per generation for our DNA segment. You would then simulate the random number of mutations by using the R function num.mutations <- rpois(1,lambda=1).
  3. Apply the mutation function for each individual DNA copy (a total of N=100) during gamete production (100 gametes for each individual) at each generation (for a total of G=1000 generations).
  4. Write another function to calculate, for each generation, instead of counting allele frequencies (as last week's problem), to calculate & output average pairwise differences among the 100 individuals.
  5. Finally, you would graph pi over generation.
Submitted Codes
  • By John
import numpy as np
import matplotlib.pyplot as plt
from itertools import combinations
from random import sample
from numpy.random import choice
from numpy.random import poisson

# global variables
nuc = np.array(["A", "T", "C", "G"])
expected_mutation_rate = 0.00001 # there are 2 sequences per individual
seq_length = 10000
individual_pop = 100
gamete_rate = 100
generation = 1000

def simulator(population, repro_rate, generation):
    observed_mutation_collection = []
    
    # Original Individuals
    original = choice(nuc, seq_length, 0.25)
    individual_total = [original for i in range(population)]
    
    for i in range(generation): # iterate over 1000 generations
        # Produce Gametes
        gametes = []
        for individual in individual_total:
            gametes += [individual for i in range(repro_rate)]
        gametes = np.array(gametes)
        gametes_pop = gametes.shape[0] # number of gametes: 100 * 100
        
        # Mutation
        mutation_number_arr = poisson(lam=seq_length * expected_mutation_rate, size=gametes_pop) # derive number of mutations per individual
        mutation_index_arr = [sample(range(seq_length), mutations) for mutations in mutation_number_arr] # get the index of mutation
        # Mutation: Replace with Mututated Base Pairs
        for i in range(gametes_pop): # iterate over all gametes
            for ix in mutation_index_arr[i]: # iterate over mutated base pair for each gamete
                gametes[i, ix] = choice(nuc[nuc != gametes[i, ix]], 1)[0] # locate mutation and alter the nuc with mutated one
        
        # Next generation of individuals
        individual_total = gametes[choice(range(10000), 100, replace=False)]
        
        # Calculate observed mutation rate
        num_combinations = 0
        total_diff_bases = 0
        for pair in combinations(range(len(individual_total)), 2): # aggregate mutations for all possible pairs of individuals
            total_diff_bases += len(np.where((individual_total[pair[0]] == individual_total[pair[1]]) == False)[0])
            num_combinations += 1
        observed_mutation_rate = float(total_diff_bases) / float(num_combinations) / seq_length
        observed_mutation_collection.append(observed_mutation_rate)
        
    return(observed_mutation_collection)

# Simulation
result = simulator(100, 100, 1000)

# Visualize results
plt.plot(result, 'b')
plt.xlabel("Generation")
plt.ylabel("Mutation Rate")
plt.title("Drift-Mutation Balance")
plt.show()
  • By Weigang
library(Biostrings) # a more compressed way to store and manipulate sequences
bases <- DNA_ALPHABET[1:4];
dna <- sample(bases, size = 1e5, replace = T);
library(ape) # to use the DNAbin methods
# function to mutate (Poisson process)
mutate.seq <- function(seq, mutation.rate) {
  mu.exp <- length(seq) * mutation.rate;
  mu.num <- rpois(1, lambda = mu.exp);
  if (mu.num > 0) {
    pos <- sample(1:length(seq), size = mu.num);
    for (j in 1:length(pos)) {
      current.base <- seq[pos[j]];
      mutated.base <-sample(bases[bases !=current.base)], size = 1);
      seq[pos[j]] <- mutated.base;
    }
  }
  return(seq)
}

# main routine
drift.mutation <- function(pop.size=100, generation.time=500, gamete.size=10, mu=1e-5) {
  out <- data.frame(generation=numeric(), pi=numeric()); # for storing outputs
  pop <- lapply(1:pop.size, function(x) dna)  # create the initial population
  for(i in 1:generation.time) { # for each generation
    cat("at generation", i, "\n"); # print progress
    gametes <- sapply(1:gamete.size, function(x) {lapply(pop, function(x) mutate.seq(x, mu))}); 
    pop <- sample(gametes, pop.size); 
    pop.bin <- as.DNAbin(pop); # change into DNAbin class to advantage of its dist.dna() function
    out <- rbind(out, data.frame(generation=i, pi=mean(dist.dna(pop.bin))));
  }
  out;
}
out.df <- drift.mutation(); # run function and save results into a data frame

March 18, 2017 "Genetic Drift" (Due 3/31/2017)

Drift.png
Drift-founder.png

This is our first biological simulation. Mandatory assignment for all lab members (from interns to doctoral students). An expected result is shown in the graph.

Task: Simulate the Wright-Fisher model of genetic drift as follows:

  1. Begin with an allele frequency p=0.5, and a pop of N=100 haploid individuals [Hint: pop <- c(rep("A", 50), rep("a", 50))]
  2. Each individual produces 100 gametes, giving a total of 10,000 gametes [Hint: use a for loop with rep() function]
  3. Sample from the gamete pool another 100 to give rise to a new generation of individuals [Hint: use sample() function]
  4. Calculate allele frequency [Hint: use table() function]
  5. Repeat the above in succession for a total generation of g=1000 generations [Hint: create a function with three arguments, e.g., wright.fisher(pop.size, gamete.size, generation.time)]
  6. Plot allele frequency changes over generation time
  7. Be prepared to answer these questions:
    1. Why allele frequency fluctuate even without natural selection?
    2. What's the final fate of population, one allele left, or two alleles coexist indefinitely?
    3. Which population can maintain genetic polymorphism (with two alleles) longer?
    4. Which population gets fixed (only one allele remains) quicker?
Submitted Codes
  • By Lili (May, 2019)
#Simple sampling, haploid, as a function 
wright_fisher <- function(pop_size, gam_size, alle_frq, n_gen) {
  pop <- c(rep("A", pop_size*frq), rep("a", pop_size*frq))
  prob <- numeric(n_gen)
  for(time in 1:n_gen){
    gamt <- rep(pop, gam_size)
    pop <- sample(gamt, pop_size)
    prob[time] <- table(pop)[1]/pop_size
  }
  wf_df <- data.frame(generation=1:n_gen, probability=prob)
  plot(drift_df, type= 'l', main = "Wright-Fisher Process (Genetic Drift)", xlab = "generation", ylab = "allele frequency")
}

wright_fisher(1000, 100, 0.5, 1000)

#Version 2. Using binomial distribution with replicates, diploid

pop_size <- c(50, 100, 1000, 5000)
alle_frq <- c(0.01, 0.1, 0.5, 0.8)
n_gen <- 100
n_reps <- 50
genetic_drift <- data.frame()

for(N in pop_size){
  for(p in alle_frq){
    p0 <- p
    for(j in 1:n_gen){
      X <- rbinom(n_reps, 2*N, p)
      p <- X/(2*N)
      rows <- data.frame(replicate= 1:n_reps, pop=rep(N, n_reps), gen=rep(j, n_reps), frq=rep(p0, n_reps), prob=p )
      genetic_drift <- rbind(genetic_drift, rows)
    }
  }
}

library(ggplot2)
ggplot(genetic_drift, aes(x=gen, y=prob, group=replicate)) + geom_path(alpha= .5) + facet_grid(pop ~ frq) + guides(colour=FALSE)

# 3rd version: trace ancestry (instead of allele frequency)
  • By John
#Python
import numpy as np
import sys
from random import sample
import matplotlib.pyplot as plt
 
def simulator(gametes_rate, next_individuals, generations):
    individuals = [1 for i in range(50)] + [0 for j in range(50)]
    frequency = []
    for generation in range(generations):
        gametes = []
        for individual in individuals:
            gametes += [individual for i in range(gametes_rate)]
        individuals = sample(gametes, next_individuals)
        frequency.append(np.count_nonzero(individuals) / len(individuals))
    return(frequency)
 
N_100 = simulator(100, 100, 1000)
N_1000 = simulator(100, 1000, 1000)
 
# Create plots with pre-defined labels.
# Alternatively,pass labels explicitly when calling `legend`.
fig, ax = plt.subplots()
ax.plot(N_100, 'r', label='N=100')
ax.plot(N_1000, 'b', label='N=1000')
 
# Add x, y labels and title
plt.ylim(-0.1, 1.1)
plt.xlabel("Generation")
plt.ylabel("Frequency")
plt.title("Wright-Fisher Model")
 
# Now add the legend with some customizations.
legend = ax.legend(loc='upper right', shadow=True)
 
# The frame is matplotlib.patches.Rectangle instance surrounding the legend.
frame = legend.get_frame()
frame.set_facecolor('0.90')
 
# Set the fontsize
for label in legend.get_texts():
    label.set_fontsize('large')
 
for label in legend.get_lines():
    label.set_linewidth(1.5)  # the legend line width
plt.show()
#Scala
import scala.util.Random
import scala.collection.mutable.ListBuffer
 
val gamete_rate = 100
val offspr_rate = 100
val generations = 1000
 
var frequency: List[Double] = List()
 
var individuals = List.fill(50)(0) ++ List.fill(50)(1)
 
for(generation <- 1 to generations){
  var gametes = individuals.map(x => List.fill(gamete_rate)(x)).flatten
  individuals = Random.shuffle(gametes).take(offspr_rate)
  frequency = frequency :+ individuals.count(_ == 1).toDouble / offspr_rate
}
 
print(frequency)
#Spark
val gamete_rate = 100
val offspr_rate = 100
val generations = 300
 
var frequency: List[Double] = List()
var individuals = sc.parallelize(List.fill(50)("0") ++ List.fill(50)("1"))
 
for(generation <- 1 to generations){
  val gametes = individuals.flatMap(x => (x * gamete_rate).split("").tail)
  individuals = sc.parallelize(gametes.takeSample(false, offspr_rate))
  val count = individuals.countByValue
  frequency = frequency :+ count("1").toDouble / offspr_rate
}
 
print(frequency)
  • By Brian
Wright<-function(pop.size,gam.size,generation) {
 if( pop.size%%2==0) {
 
  pop<-c(rep("A",pop.size*.5),rep("a",pop.size))
  frequency<-.5
  time<-0
  geneticDrift<-data.frame(frequency,time)
  for (i in 1:generation) {
    largePop<-rep(pop,gam.size)
    samplePop<-sample(largePop,100)
    cases<-table(samplePop)
    if (cases[1]<100 && cases[2]<100 ) 
        {propA<-(cases[1]/100)
        geneticDrift<-rbind(geneticDrift,c(propA,i))
        pop<-c(rep("A",cases[2]),rep("a",cases[1]))
  }  
}
  plot(geneticDrift$frequency~geneticDrift$time,type="b", main="Genetic Drift N=1000", xlab="time",ylab="Proportion Pop A",col="red", pch=18 )
}
 if(pop.size%%2==1 ) {
   print("Initial Population should be even. Try again")
 }
}
Wright(1000,100,1000)
  • By Jamila
Genetic_code = function(t,R){ 
  N<- 100
  p<- 0.5
  frequency<-as.numeric();
  for (i in 1:t){
    A1=rbiom(1,2*N,p)
    p=A1/(N*2); 
    frequency[length(frequency)+1]<-p; 
  }
  plot(frequency, type="1",ylim=c(0,1),col=3,xlab="t",ylab=expression(p(A[1])))
  
  }
  • By Sharon
p=0.5
N=100
g=0
pop <- c(rep("A", 50), rep("a", 50)) #create a population of 2 alleles
data<-data.frame(g,p) #a set of variables of the same # of rows 
for (i in 1:1000) { #generation 
  gam_pl <-rep(pop, 100) #gamete pool
  gam_sam <-sample(gam_pl, 100)  #sample from gamete pool
  tab_all<-table(gam_sam) #table ps sample
  all_freq<-tab_all[1]/100 #get allele frequency
  data<-rbind(data, c(i,all_freq))
  pop<-c(rep("A",tab_all[1]))
}
plot(data$g,data$p)
  • By Sipa
pop <- c(rep("A", 50), rep("a", 50))
wright_fisher <- function(N,gen_time,gam) {
  N=N
  gen_time = gen_time
  x = numeric(gen_time) 
  x[1] = gam
  for (i in 2:1000) {
  k=(x[i-1])/N
  n=seq(0,N,1)
  prob=dbinom(n,N,k)
  x[i]=sample(0:N, 1, prob=prob)
  
}

plot(x[1:gen_time], type="l", pch=10)

}
pool <- rep(pop, times = 100)
s_pool <- sample(pool, size = 100, replace = F)
table(s_pool)
wright_fisher2 <- function(pop_size,al_freq,gen_time) {
  pop <- c(rep("A", pop_size/2), rep("a", pop_size/2))
  pool <-rep(pop, times=100)
  s_pool <- sample(pool, size = 100, replace = T)
  for(i in 1:gen_time)
    s_pool <- sample(sample(pool, size = pop_size, replace = F))
  a.f <- table(s_pool)[1]/100
  return(a.f)
  }
  • By Nicolette
N=100 
p=0.5 
g=1000 
sim=100 
Genetic_drift=array(0, dim=c(g,sim))
Genetic_drift[1,]=rep(N*p,sim)
for(i in 1:sim) {
  for(j in 2:g){
    X[j,i]=rbinom(1,N,prob=X[j-1,i]/N)
  }
}
Genetic_drift=data.frame(X/N)
matplot(1:1000, (X/N), type="l",ylab="allele_frequency",xlab="generations")
  • By Weigang
# output frequency:
wright.fisher <- function(pop.size, generation.time=1000, gametes.per.ind=100) {
  out.df <- data.frame(gen=numeric(), freq=numeric(), pop.size=numeric());
  pop <- c(rep("A", pop.size/2), rep("a", pop.size/2)); # initial pop
  for(i in 1:generation.time) {
    gametes <- unlist(lapply(pop, function(x) rep(x, gametes.per.ind)));
    pop <- sample(gametes, pop.size); # sample gamete pool without replacement
    freq.a <- table(pop)[1]/pop.size; # frequency of "a"
    out.df <- rbind(out.df, c(gen=i, freq=freq.a, pop=pop.size, rep=rep));
  }
  out.list;
}

out.1e2<-wright.fisher(100)
out.1e3<-wright.fisher(1000)
# Make Figure 1 above
plot(out.1e2[,1], out.1e2[,2], type = "l", las=1, ylim=c(0,1), xlab="generation", ylab="allele frequency", main="Wright-Fisher Process (Genetic Drift)", col=3, lwd=2)
lines(out.1e3[,1], out.1e3[,2], type = "l", col=2, lwd=2)
legend(400, 0.2, c("N=100", "N=1,000"), lty=1, col=3:2, cex=0.75)
abline(h=0.5, col="gray", lty=2)

# Second function to track founders (not frequency)
wright.fisher.2 <- function(pop.size, generation.time=1000, gametes.per.ind=100) {
  pop <- 1:100 # label founders
  out.list <- list(pop);
  for(i in 1:generation.time) {
    gametes <- unlist(lapply(pop, function(x) rep(x, gametes.per.ind)));
    pop <- sample(gametes, pop.size); # sample gamete pool without replacement
    out.list[[i+1]] <- pop;
  }
  out.list;
}

out.1ist<-wright.fisher.2(100)
# plot loss of genetic diversity by tracking surviving founder lineages
plot(x=1:100, y=rep(1, 100), ylim=c(0,100), xlim=c(1,600), type="n", las=1, main="Drift: Suvival of founder lineages\n(N=100)", cex.main=0.75, xlab="generation", ylab="founders")
for (i in 1:600) { points(rep(i,100), 1:100, col=colors()[out.list[[i]]], cex=0.5) }
for (i in 1:600) {points(rep(i,100), out.list[[i]], cex=0.2, col="yellow") }

March 11, 2017 "Stoplights" (Due 3/18/2017)

Stoplight-1.png
  • Source: Paul Nahin (2008). "Digital Dice", Problem 18.
  • Challenge: How many red lights, on average, will you have to wait for on your journey from a city block m streets and n avenues away from Belfer [with coordinates (1,1)]? (assuming equal probability for red and green lights)
  • Note that one has only wait for green light when walking along either the north side of 69 Street or east side of 1st Avenue. On all other intersections, one can walk non-stop without waiting for green light by crossing in the other direction if a red light is on.
  • Formulate your solution with the following steps:
  1. Start from (m+1,n+1) corner and end at (1,1) corner
  2. The average number of red lights for m=n=0 is zero
  3. Find the average number of red lights for m=n=1 by simulating the walk 100 times
  4. Increment m & n by 1 (but keep m=n), until m=n=1000
  5. Plot average number of red lights by m (or n).
Submitted Codes
  • By John
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from pandas import DataFrame
from random import sample
from itertools import combinations, permutations
from numpy import count_nonzero, array
 
def red_light(av, st):
    traffic_light = ["red_st", "red_av"]
    total_cross = av + st - 1
    while av != 0 and st != 0:
        if sample(traffic_light, 1)[0] == "red_st":
            av -= 1
        else:
            st -= 1
    rest_cross = av if av != 0 else st
    return(count_nonzero([sample([0, 1], 1) for corss in range(rest_cross)]))
 
df = DataFrame(0, index=np.arange(10), columns=np.arange(10))
 
simulation = 1000
for av in df.index:
    for st in df.index:
        df.loc[av, st] = sum(array([[red_light(av, st)] for n in range(simulation)])) / simulation
        
plt.imshow(df, cmap='hot', interpolation='nearest')
plt.xticks(np.arange(1, 10))
plt.yticks(np.arange(1, 10))
plt.title("Average Numbers Waiting for Red Light")
plt.xlabel("Number of Av")
plt.ylabel("Number of St")
plt.show()
 
df
 
# Recursive Function for Each Trial

from sys import stdout
av = 30
st = 30
n_space = [0]
traffic_light = ["red_st", "red_av"]
def route(av, st):
    if not av == st == 0:
        if sample(traffic_light, 1)[0] == "red_st":
            if av == 0:
                stdout.write("w_")
                n_space[0] += 1
                route(av, st - 1)
            else:
                stdout.write("\n" + " " * n_space[0] + "|")
                route(av - 1, st)
        else:
            if st == 0:
                stdout.write("\n" + " " * n_space[0] + "w" + "\n" + " " * n_space[0] + "|")
                route(av - 1, st)
            else:
                stdout.write("_")
                n_space[0] += 1
                route(av, st - 1)
route(av, st)
  • By Weigang
#!/usr/bin/env perl
use strict;
use warnings;

# Use recursion
my ($distx, $disty) = @ARGV;
my $num_red = 0; # keep red-light counts
print "-" x $distx, "\n"; # print a starting line
&walk($distx, $disty, \$num_red); # pass reference not value
exit;

sub walk {
    my ($x, $y, $ref) = @_;
    my $ct = $$ref;
    my $prob_red;
    if ($x == 1 && $y == 1) { # Reached destination
	print "*\n";
	print "-" x $distx, "\n"; # print the ending line
	print "reached Belfer after waiting for ", $ct, " red lights\n";
    }

    if ($x == 1 && $y > 1) { # Reached right-side end
	$prob_red = rand();
	if ($prob_red < 0.5) { # red light, wait
	    $ct++;
	    print "w\n", " " x ($distx-1);
	} else {
	    print "|\n", " " x ($distx-1);
	}
	&walk($x, $y-1, \$ct);
    }

    if ($x > 1 && $y == 1) { # Reached bottom
	$prob_red = rand();
	if ($prob_red < 0.5) { # red light, wait
	    $ct++;
	    print "w";
	} else { print "-"}
	&walk($x-1, $y, \$ct);
    }

    if ($x > 1 && $y > 1) {
	my $prob_across = rand(); # prob of walking right with green light
	if ($prob_across >= 0.5) { # move one block right
	    print "-";
	    &walk($x-1, $y, \$ct);
	} else { # red light, move one block down
	    print "|\n", " " x ($distx-$x);
	    &walk($x, $y-1, \$ct);
	}
    }
}
  • By Jeff
# In[26]:
def walk (walker):
    import random
    
    def just_walk (x):                                  #[1]
        if random.choice(["red_m_direction", "green_m"])=="green_m": x["loca"]["m"] -=1
        else: x["loca"]["n"] -=1
        return x    
    def may_have_to_wait (dist,waited):                 #[2]
        if random.choice(["green","red"])=="red": waited += 1
        else: dist -= 1
        return (dist,waited)    

    while (0 not in walker["loca"].values()):         #start walking
        walker = just_walk(walker) 
    while (walker["loca"]["m"] !=0):       # if n =0 and m != 0
        (walker["loca"]["m"], walker["waited"]) = may_have_to_wait(walker["loca"]["m"],walker["waited"]) 
    while (walker["loca"]["n"] !=0):       # if m =0 and n != 0
        (walker["loca"]["n"], walker["waited"]) = may_have_to_wait(walker["loca"]["n"],walker["waited"]) 

    return walker["waited"]

def given_distance(m,n,rep): 
    waited = list()
    for x in range(rep):
        walker={"loca":{"m":m,"n":n}, "waited":0}     #walker variable
        waited.append(walk(walker))      
    return sum(waited)/len(waited)
# main
result=list()
for d in range(1000):                                  # set walking distance here
        result.append(given_distance (d,d,100))      # set rep here
import matplotlib.pyplot as plt                     # ploting 
plt.plot(result); plt.show()  

#[1] no reason to wait for ANY red before one of the distances (m or n) is exhausted. Assuming when light is green for m, it must be red for n, vise versa. 
#[2] when one of the directions (m or n = 0), waiting cannot be avoided, start counting.

March 3, 2017 "PiE" (Due 3/10/2017)

  • Source: Paul Nahin (2008). "Dueling Idiots", Problem 5.
  • Challenge: obtain numerical values of Pi and E by simulations
  • Simulate pi by
    Pi-sim.png
  1. Randomly generate 10,000 pairs of uniformly distributed numbers from 0 and 1 (simulating throwing darts onto the unit square shown at right)
  2. Count the number of points enclosed within the quarter-circle
  3. Calculate the pi value from this proportion
  • Simulate e by
  1. Generate N random numbers from 0 to 1
  2. Divide into N equal-width bins between 0 and 1
  3. Count the number of bins Z that receive none of the random numbers
  4. Obtain e ~ N/Z (based on binomial sampling formula)
  5. Simulate N=1e2, 1e3, and 1e4
Submitted Codes
  • By Weigang
# simulate pi:
darts <- sapply(1:1000, function(x) { coords<- runif(2); return(ifelse(coords[1]^2+coords[2]^2 <= 1, 1,0)) })
pi <- 4*mean(darts)

# simulate e
N <- 1e3;
n <- runif(N);
cts <- numeric();
for(i in 1:(N-1)) {
  left <- i/N;
  right <- (i+1)/N;
  cts[i] <- length(which(n>=left & n < right))
}
e <- N/length(which(cts==0))
  • By Nicolette
#Simulate pi
random <- 0; square <- 0;
for (i in 1:1000){
  random[[i]]<- runif(1,0,1)
square[[i]]<- sqrt(1-(random[i])^2)
}
plot(random, square)
areaofqc <- (pi/4)
ranleqc <- length(which(random <=areaofqc))
squleqc <- length(which(square<=areaofqc))

#Simulate e
e <- 0;
for(N in 1:1000) {
  numbers<-runif(N)
bin.size<-1/N
  non.empty<-as.integer(numbers/bin.size)
  z.empt<- N - length(table(non.empty))
  e<-c(e, N/z.empt)
  }
plot(e)
  • By Brian
#When you throw a dart at the a unit square dartboard the Probability of hitting the portion of the circle radius with center=(0,0) inside the unit square is  pi/4.  We can say that hitting the 1/4 circle within the unit square with a dart is a bernoulli random variable where p=pi/4. Further, we can say E(bernoulli=hit)=pi/4. 
# Imagine you are scoring the game of darts as follows- 1 point if you throw in the 1/4 circle and 0 points if you miss this space. 
# If you throw 10000 darts you just did 10000 iid bernoulli trials. By the law of large numbers if we count the number of hits to the 1/4 circle of radius 1 and divide by number of darts thrown we will get something pretty close to E(beroulli) . Multiply that number by 4 and you have an estimate of Pi.  

## Generate 10,000 pairs of points in the unit square with uniform distribution

x<-c(runif(1000000, min = 0, max = 1))
y<-c(runif(1000000, min = 0, max = 1))
point<-data.frame(x,y)

plot(point$x,point$y)
point.sub<-subset(point,y<=sqrt(1-x^2))

plot(point.sub$x,point.sub$y)

z<-4*nrow(point.sub)/1000000
z
error<-pi-z
error

## simulating exp
#It's a well that a binomial with large n and tiny p is a good estimate of the poisson random For this estimate lambda=np.  In this case n=10,000 and p=the probability of falling into a particular unit=1/10000....... 
 simexp<-function (n) {
a<-c(runif(n,min=0,max=1))
b<-c()

for (i in 1:n) {
  c<-subset(a,(i-1)/n<a & a<=i/n)
  d<-length(c)
  b<-append(b,d)
}
bzero<-subset(b,b==0)
length(bzero)
print(n/length(bzero))
}
simexp(1000)
simexp(10000)
simexp(100000)
  • By John
#python code
# Simulate pi
simulation = 10000
distances = np.array([pow(uniform(0, 1), 2) + pow(uniform(0, 1), 2) for i in range(simulation)])
pi = count_nonzero(distances < 1) / simulation * 4
pi

# Simulate e
total = 10000
ranges = [[x/total, (x+1)/total] for x in range(total)]
sample_space = [uniform(0, 1) for x in range(total)]
index = []
for i in range(total):
    if any(ranges[i][0] <= point <= ranges[i][1] for point in sample_space):
        index.append(i)
e = total / (total - len(set(index)))
e

Feb 24, 2017 "Idiots" (Due 3/3/2017)

Idiots.png
  • Source: Paul Nahin (2000), "Dueling Idiots". Problem #2: "When Idiots Duel"
  • Game: Idiots A and B decide to duel with a gun in the following way: They will insert a single bullet into the gun's cylinder. The cylinder has a total of 6 slots. Idiot A will spin the cylinder and shoot at B. If the gun doesn't fire, then A will give the gun to B, who will spin the cylinder and then shoot at A. This back-and-forth duel will continue until one fool shoots (and kills) the other.
  • Questions: (1) what is the probability that A will win (and B dies); (2) What is the average number of trigger pulls before someone dies?
Submitted Codes
  • By Roy
gun <-c(1,0,0,0,0,0)
gun2 <- c(1,0,0,0,0,0)
deadman <-0 ;idiot1win <-0; idiot2win <- 0; nooned <- 0; total <- 0
 while(deadman < 1000){
  idiot1shot <-sample(gun)
  if (length(which(idiot1shot[1] == 1))){
    deadman <- deadman + 1
    idiot1win <- idiot1win + 1
  }else{
      idiot2shot <-sample(gun2)
      if (length(which(idiot2shot[1] == 1))){
        deadman <- deadman +1
        idiot2win <- idiot2win +1
      } else {nooned <- nooned + 1}
  }
  total <- total + 1
  }


deadman
idiot1win
idiot2win
nooned
total

p <- idiot1win/1000
p*100
takes2kill <- deadman/total
takes2kill*100
idiot1shot
idiot2shot
  • By Weigang
rounds <- sapply(1:1000, function(x) {
  alive <- 1;
  round <- 0;
  while(alive == 1){
    spin <- sample(c(0,0,0,0,0,1));
    round <- round + 1;
    if(spin[1] == 1) { alive <- 0 }
  }
  return(round)
})
prob.a.live <- length(which(rounds %% 2 == 0))
barplot(table(rounds)/1000, xlab="num rounds", ylab="Prob", main = "Sudden Death with a 6-slot gun (sim=1000)", las=1)

Feb 17, 2017 "Birthday" (Due 2/24/2017)

B-day.png
  • Problem: What is the probability NONE of the N people in a room sharing a birthday?
  1. Randomly select N individuals and record their B-days
  2. Count the B-days NOT shared by ANY two individuals
  3. Repeat (for each N) 1000 times, and obtain probability by averaging the previous counts (i.e., divided by 1000)
  4. Vary N from 10 to 100, increment by 10
  5. Plot probability of no-shared B-Day (Y-axis) versus N (x-axis), with either a stripchart or boxplot, or both
Submitted Codes
  • By Roy
N <- 0
samples <- 0;
prob.nodub <-0;
for(j in 10:100){
  counting.no.dups <- 0;
  test <- for(i in 1:1000){
  bdays <- sample(seq(as.Date('1990/01/01'), as.Date('1990/12/31'), by="day"), N, replace=T)
  dups <- duplicated(bdays, incomparables = FALSE)
  ch <-length(which(dups == TRUE))
    if(ch==0){
      counting.no.dups <- counting.no.dups +1
    }
    fine <- (counting.no.dups/1000)
  }
  print(N)
  print(fine)
  N <- N + 1
  samples[[j]] <- N
  prob.nodub[[j]] <- fine
}
plot(samples,prob.nodub, main="Birthday Simulation", xlab = "Sample Size", ylab = "Probability of no Duplicates", las =1)
  • By weigang
days <- 1:365;

find.overlap <- function(x) { return(length(which(table(x)>1))) }

output <- sapply(1:100, function(x) { # num of people in the room
  ct.no.overlap <- 0;
  for (k in 1:100) {
    bdays <- sample(days, x, replace = T);
    ct <- find.overlap(bdays);
    if (!ct) { ct.no.overlap <- ct.no.overlap + 1}
  }
  return(ct.no.overlap);
})
 
plot(1:100, output/100, xlab="Group size", ylab = "Prob (no shared b-day)", las=1, main="B-day (sim=100 times)", type="l")
abline(h=0.5, lwd=2, col=2, lty=2)
abline(v=seq(0, 100, 5), col="gray")

Feb 10, 2017 "Dating" (Valentine's Day Special; Due 2/17/2017)

Dating.png
  • Source: Paul Nahin (2008), "Digital Dice". Problem #20: "An Optimal Stopping Problem"
  • Problem: What is the optimal time point when one should stop dating more people and settle on a mate choice (and live with the decision)
  • Your best strategy is to date an initial sample of N individuals, rejecting all, and marry the next one ranked higher than any of your N individuals. The question is what is the optimal number for N.
  1. The problem could be investigated by simulating a pool of 10 individuals, ranked from 1-10 (most desirable being 1) and then take a sample of N
  2. You may only date one individual at a time
  3. You cannot go back to reach previously rejected candidates
  4. Simulate N from 0 to 9 (zero means marrying the first date, a sample size of zero)
  5. For each N, obtain the probability of finding the perfect mate (i.e., ranked 1st) by running simulation 1000 times
  6. Plot barplot of probability versus sample size N.
  7. Expected answer: N=4
Submitted Codes
  • By Weigang
pick.candidate <- function(min, array) {
  for (i in 1:length(array)) {
    if (array[i] < min) {
      return(array[i])
    } else {next}
  }
  return(0) # No 1. has been sampled and rejected
}
candidates <- 1:10;
output <- sapply(0:9, function(x) {
  ct <- 0;
  for(k in 1:1000) {
    if (x==0) { # no sample, marry the 1st guy 
      sampled <- sample(candidates, 1);
      if (sampled == 1) {ct <- ct+1}
    } else {
      sampled <- sample(candidates, x);
      not.sampled <- candidates[-sampled];
      not.sampled <- sample(not.sampled);
      if (pick.candidate(min =  min(sampled), array = not.sampled) == 1) {ct <- ct+1}
    }
  }
  return(ct);
})
barplot(output/1e3, names.arg = 0:9, xlab = "number of sampled dates", las=1, main = "Optimal stopping for dating (N=10 candidates)", ylab = "Prob(marrying No.1)")

Feb 3, 2017 "US Presidents" (Due 2/10/2017)

Sim-presidents.png
  • Download : 1st column is the order, 2nd column is the name, the 3rd column is the year of inauguration; tab-separated
  • Your job is to create an R, Perl, or Python script called “us-presidents”, which will
  1. Read the table
  2. Store the original/correct order
  3. Shuffle/permute the rows and record the new order
  4. Count the number of matching orders
  5. Repeat Steps 3-4 for a 1000 times
  6. Plot histogram or barplot (better) to show distribution of matching counts
  7. Hint: For R, use the sample() function. For Perl, use the rand() function.
Submitted Codes
  • By Mei
pres.list <- lapply(1:1000, function(x) pres[sample(nrow(pres)),])
cts <- sapply(pres.list, function(x) {
  ct.match <- 0;
  for (i in 1:45){
    if (pres$order[i] == x[i,1]){
      #cat(as.character(x[i,2]), x[i,1], "\n")  #as.character to avoid the factor info
      ct.match <- ct.match + 1;
      #cat(ct.match)
    }
  }
  ct.match #return
})
barplot(table(cts),xlab = "Number of matches per shuffle", border = "hotpink", col = "pink", ylab = paste("Frequency total of:",length(pres.list)), main = "US Presidents")
  • By John
import pandas as pd
import numpy as np
from pandas import DataFrame
import matplotlib.pyplot as plt

df = pd.read_table("presidents.txt", names=["num", "name", "presidency"])

name_list = list(df.name)

# create a list to store matches after each shuffle
shuffle_record = []

# create a function in which the first argument is the original dataset; second argument is number of shuffles
def shuffler2(original, n):
    record = []
    for i in range(n):
        num = 0
        each_shuffle = {}
        temp = original.reindex(np.random.permutation(original.index)) # do shuffling for each
        compare = original.num == temp.num
        matched_df = original.ix[compare[compare == True].index]
        for i in matched_df.index:
            each_shuffle[i] = matched_df.name[i]
        shuffle_record.append(each_shuffle)
        try:
            num = compare.value_counts()[1]
        except:
            pass
        record.append(num)
    return(record)
result2 = shuffler2(df, 1000)

plt.hist(result2, color="yellow")
plt.title("Histgram for President Data")
plt.show()
  • By Weigang
p <- read.table("Presidents.txt", sep="\t", header=F)
colnames(p) <- c("order", "name", "inaug.year")
# Use "sapply" or "lapply" for loops: no need to pre-define a vector to store results
p.sim <- sapply(1:10000, function (x) {
  length(which(sample(p$order) == p$order))
  }) 
barplot(table(p.sim)/1e4, las=1)
p.exp <- rpois(1e4, 1) # draw 10000 Poisson random deviates
mp <- barplot(table(p.exp)/1e4, las=1, xlab = "Num of matching presidents") # mid-point on x-axis
lines(mp[1:7], table(p.sim)/1e4, type="b", col=2) # add a line (Poisson-expected) to the barplot (simulated)
legend("topright", c("Simulated", "Poisson expectation"), col=1:2, lty=1)