Bioinformatics Workshop 2014: Difference between revisions

Revision as of 06:22, 28 July 2014

Summer Bioinformatics Workshop (BIOL 470.83/790.86, Summer II 2014) Instructors: Drs Konstantinos Krampis & Weigang Qiu, Levy Vargas Room:1001B HN (10th Floor, North Building) Hours: Tues & Thur 11:30 am-15:00 Office Hours: Room 830 HN; Tuesday 3-5pm or by appointment Contacts: Konstantinos Krampis <python4bio at gmail.com>; Levy Vargas <levy.vargas at gmail.com>

Course Description

Background

Biomedical research is becoming a high-throughput science. As a result, information technology plays an increasingly important role in biomedical discovery. Bioinformatics is a new interdisciplinary field formed by the merging of molecular biology and computer science techniques.Today’s biology students must therefore not only learn to perform in vivo and invitro, but also in silico research skills. Quantitative/computational biologists are expected to be in increasing demand in the 21st century.

However, the technical barrier to enter the field and perform basic research projects in a bioinformatics lab is daunting for most undergraduate students. This is mainly due to the multidisciplinary nature of quantitative biology, which requires understandings and skills in chemistry, biology, computer programming, and statistics. The Hunter Summer Bioinformatics Workshop aims to introduce bioinformatics to motivated undergraduate and high school students by lowering the barrier and dispensing the usual pre-requisites in advanced biology/chemistry courses as well as entry-level programming/statistics courses. The Workshop does not assume prior programming experience.

The workshop DOES NOT

Replace existing advanced bioinformatics courses such as BIOL425 and STAT 319
Teach advanced bioinformatics programming skills (e.g., advanced data structure, object-oriented Perl, BioPerl, or relational database with SQL), which are the contents of BIOL425
Teach in-depth statistics or the popular R statistical package, although probabilistic thinking (e.g., distributions of a random variable, stochastic processes, likelihood, clustering analysis) is at the core of all bioinformatics analysis (STAT 319 teaches these topics)

To learn these advanced bioinformatics topics and skills, motivated students are encouraged to enroll in one of the Five Bioinformatics Concentrations of at Hunter. The QuBi program prepares the students for bioinformatics positions in a research lab or a biotechnology company.

This course will introduce both bioinformatics theories and practices. Topics include: database searching, sequence alignment, and basic molecular phylogenetics. The course is held in a UNIX-based instructional lab specifically configured for bioinformatics applications. Each session consists of a first-half instruction on bioinformatics theories and a second-half session of hands-on exercises.

Learning Goals

Students are expected to be able to:

Retrieve and analyze DNA and protein sequences using online databases
Write simple computer programs to manipulate DNA sequences

Textbook

No textbook required, handouts will be provided in the class.

Grading & Academic Honesty

Hunter College regards acts of academic dishonesty (e.g., plagiarism, cheating on examinations, obtaining unfair advantage, and falsification of records and official documents) as serious offenses against the values of intellectual honesty. The College is committed to enforcing the CUNY Policy on Academic Integrity and will pursue cases of academic dishonesty according to the Hunter College Academic Integrity Procedures.

Student performance will be evaluated by weekly assignments and projects. While these are take-home projects and students are allowed to work in groups, students are expected to compose the final short answers, computer commands, and code independently. There are virtually an unlimited number of ways to solve a computational problem, as are ways and personal styles to implement an algorithm. Writings and blocks of codes that are virtually exact copies between individual students will be investigated as possible cases of plagiarism (e.g., copies from the Internet, text book, or each other). In such a case, the instructor will hold closed-door exams for involved individuals. Zero credits will be given to ALL involved individuals if the instructor considers there is enough evidence for plagiarism. To avoid being investigated for plagiarism, Do Not Copy from Others & Do Not Let Others Copy Your Work.

The grading scheme for the course, is as follows (Subject to some change. You will be notified with sufficient time):

In-Class Assignments: 8 exercises, 20 points each. [Attendance is mandatory)
Weekly assignment: 4 exercises, 10 points each
Mid-term: 50 points, on July 24
Final exam: 50 points, on August 14

Programming Assignment Expectations

All code must begin with the lines in the Perl slides, without exception. For each assignment, unless otherwise stated, I would like the full text of the source code. Since you cannot print using the text editor in the lab (even if you are connected from home), you must copy and paste the code into a word processor or a local text editor. If you are using a word processor, change the font to a fixed-width/monospace font. On Windows, this is usually Courier.

Code indentation is your personal taste, so long as it is consistent and readable. Use comments whenever you think either the code is unclear, or simply as a guideline for yourself. Well-commented code improves readability, but be careful not overdo it.

Also, unless otherwise stated, both the input and the output of the program must be submitted as well. This should also be in fixed-width font, and you should label it in such a way so that I know it is the program's input/output. This is so that I know that you've run the program, what data you have used, and what the program produced.

If you are working from the lab, one option is to email the code to yourself, change the font, and then print it somewhere else as there is no printer in the lab.

Course Schedule (Tuesdays and Thursdays)

July 15

Course Overview
LECTURE SLIDES (Bioinformatics)
WORKSHOP SLIDES:(Workshop1)
Workshop on Linux proficiency:
- Terminal & the bash shell
- Text editing
- First program

Assignment #1 DUE: July 22
Read (What is Bioinformatics, public via arXiv [[1]])
Required reading (Expression of Genetic Information, public via NCBI bookshelf [[2]])
Bioinformatics questions (5 pts) You are give a long DNA string, describe one or two steps the algorithm should take in order to find genes on the DNA string. (5 pts) What is the difference between a "Computer Algorithm" and a "Computer Program". Can a Program include an Algorithm ?
Linux proficiency tests (5 pts) Install and determine the version of bash and version of Perl on your computer. Windows: Install bash (choose one) Cygwin Git bash Windows: Install Perl (choose one) Strawberry Perl ActivePerl Mac: Perl should already be installed. Use the Terminal to access bash. Linux & others: Are acceptable however no installation instructions will be provided. Print your version of both Perl and bash installed on your system with the following two commands: bash --version perl -V NOTE: Your output must conform to the code standards in the syllabus above. (5 pts) Some cases of Alzheimer's Disease have been associated with mutations in the PSEN1 gene. One study indicated that a single G to T mutation resulted in deletion of exon 9. As a consequence, amino acids 290-319 where no longer translated. Using any online database (OMIM, NCBI), answer the following questions. Include the full URL of your source. What chromosome is the PSEN1 gene located? Which Alzheimer's Disease type(s) is associated with PSEN1? Copy the whole protein sequence and indicate where amino acids 290-319 are located.The protein sequence of PSEN1 can be accessed from NCBI here: PSEN1 http://www.ncbi.nlm.nih.gov/protein/15079861?report=fasta What is the overall length of the normal protein? What is the length of the deletion in AAs? What percentage of the protein is lost when the amino acids were deleted?

July 17

LECTURE SLIDES (Genes, Proteins, Mutations)
WORKSHOP SLIDES:(Workshop 2) with corrections July 22
Workshop on Linux proficiency:
- Managing files with bash commands
- Editing with vi
- Writing programs in Perl

July 22

LECTURE SLIDES:(Lecture 3: Gene Alignments and Homology)
READING MATERIAL:(Reading material on sequence alignments and BLAST algorithm)
WORKSHOP SLIDES:(Workshop 3)
Workshop on Linux proficiency:
- Input/Output with bash
- Perl Data
- Perl Input/Output

Assignment #2 DUE: July 29
Bioinformatics questions
Linux proficiency tests Some diseases have animal models which are useful for studying the disease of interest. Use NCBI's BLAST tool to help choose between two common laboratory animals for PSEN1. Go to the PSEN1 gene in NCBI's GenBank http://www.ncbi.nlm.nih.gov/protein/15079861 Select "Run BLAST" from the "Analyze this sequence" section In the "Choose a Search Set" section, enter "Mus musculus" in the Organism field Add another Organism by clicking the + button next to the first field, and enter "Rattus rattus" Go down to the bottom and push the BLAST button and wait for the results Answer the following: Which animal would be ideal for a model? (1 point) In a sentence or two, how do the BLAST results suggest this? (2 points) Which hit(s) make(s) the strongest case and describe the relevant value(s)? (2 points) Download a table of hit results by selecting the "Download" widget near the top of the browser window. Select "Hit Table (csv)" and save it. To make this file readable, change the commas into tabs, by using the tr command. Research how to use the tr command to reformat the file and save the output to a new file. Finally, use cat with the -n option on the new file in order to number the lines and name it hit_results.txt. Answer or show the following: Show the contents of the file hit_results.txt. (1 point) Show how tr and cat were used on the command line (omit output). (1 point) How many total hits are in hit_results.txt? (1 point) How many mouse hits are in hit_results.txt? (1 point) How many rat hits are in hit_results? (1 point)

July 23 ANNOUNCEMENT

Current participants are invited to join the Google Group for help, questions, and discussion. Send an email to biowork2014+subscribe@googlegroups.com or visit https://groups.google.com/d/forum/biowork2014 IT IS VERY IMPORTANT that you GET MEMBERSHIP in this group. We will use it for answering questions over the weekend for the midterm.

July 24

LECTURE SLIDES:(Lecture 4: Alignments, Homology, Molecular Phylogenetics)
REQUIRED READING:((Not) for the faint of the heart: Molecular Phylogenentics)
WORKSHOP SLIDES:(Workshop 4)
Workshop on Linux proficiency:
- Remote access
- Accounts and passwords
- File transfers

MIDTERM

PART1 - 25 points:(Questions on alignments, bioinformatics, databases, evolution, trees)
PART2 - 25 points:(Questions on Unix, bash, Perl)

The midterm is due no later than 11:30AM Thursday, July 29, 2014. The midterm must be printed out and will be collected in class at 11:30AM. Late midterms will not be accepted. Students should arrive early to turn in and get credit for the midterm.

July 29. Structure of human genome & genes (Part 1)

Lecture & Workshop Slides:
File:July-29-genome-gene-structure.pptx
Tree-thinking Quizzes:
File:Baum etal05 Quiz.pdf

July 31. Structure of human genome & genes (Part 2)

Cross-species comparisons
Web Exercise 1. Search for gene information using NCBI online databases==

Point your browser to the NCBI Human Genome Resource page
Type in the "Find A Gene" search box "TAS2R38" and select "Homo sapiens" from the pull-down menu. Click "Go"
Select the first link, which leads to an NCBI Gene Card page. Use the Gene Card to identify the following information on TAS2R38 gene:
1. NCBI GeneID
2. Chromosome location
3. Click on "GenBank" and identify its gene structure, including the length of primary transcript, coding sequences, 5'-UTR and 3'-UTR. Does it have any introns?
4. Zoom out the Sequence View to find its neighboring genes. Zoom in to read DNA sequences.
Click the link to OMIM (under Phenotype) and find phenotypes associated with TAS2R38 gene
1. What does OMIM stand for?
2. What are the expected "taster" and "nontaster" frequencies within human populations?
3. If the ability to taste bitterness is evolutionary advantageous, how are alleles contributing to "nontaster" maintained in population?
4. Is the correlation between TAS2R38 gene variations and the PTC phenotype variations 100%? If not, what could be the other causes?

Web Exercise 2. Cross-species comparisons with HomoloGene

From the NCBI "TAS2R38" Gene page, click "HomoloGene" link under the "Related Information" (right-side navigation panel)
You should see a page listing TAS2R38 orthologous (i.e., same gene in different species) genes from 7 mammalian species, including human (Homo sapiens), chimpanzee (Pan troglodytes), macaque (Macaca mulatta), dog (Canis lupus familiaris), cow (Bos taurus), rat (Rattus norvegicus), and mouse (Mus musculus).
Write down your expectations for the following species relationships:
1. Is chimpanzee more closely related to macaque or to human?
2. Is dog more related to mouse or to cow?
3. Is rat and mouse more closely related than human and chimpanzee?
Click on the link "Show Pairwise Alignment Scores" under "Protein Alignments" and fill in the following table when the page loads. Do these sequence-comparison results change your expectations in the above? Explain.

Species pair	% Protein Sequence Identity	% DNA Seq Identity
Chimp-Human	?	?
Chimp-Macaque	?	?
Dog-Cow	?	?
Dog-Mouse	?	?
Rat-Mouse	?	?

You can find exact differences by clicking on "Blast" for each pairwise comparisons. Lastly, obtain a phylogenetic tree of TAS2R38 protein sequences from these 7 species using the phylogeny.fr web

Click "Show Multiple Alignment"
Click "Download" and, when the page uploads, click "download" again
Go to the the phylogeny.fr web and select "Phylogenetic Analysis" and then "One Click" analysis
Copy and paste your downloaded sequences into the text box and click on "Submit"
When analysis is finished, you should see a phylogenetic tree. Answer the following questions:
1. Define "orthologous genes"
2. What do tree nodes represent?
3. What do tree branches and branch length represent?
4. How do you determine species relatedness based on a phylogenetic tree?
5. Do you think this gene tree reflects species relationships? How would you improve the inference of species tree (more genes, DNA instead of protein sequences)?
6. Do you think differences in protein sequences are associated with different sense of smells among these species? How would you test?

Assignment #3 DUE: Aug 5
Bioinformatics questions
Linux proficiency tests

Aug 5. Human genetic variations (Part 1)

Aug 7

Aug 12

Aug 14

Final Exam

Class Links

@@ Line 240: / Line 240: @@
 |}
-===Aug 5===
+===Aug 5. Human genetic variations (Part 1)===
 ===Aug 7===

Bioinformatics Workshop 2014: Difference between revisions

Revision as of 06:22, 28 July 2014

Contents

Course Description

Background

The workshop DOES NOT

Contents

Learning Goals

Textbook

Grading & Academic Honesty

Programming Assignment Expectations

Course Schedule (Tuesdays and Thursdays)

July 15

July 17

July 22

July 23 ANNOUNCEMENT

July 24

MIDTERM

July 29. Structure of human genome & genes (Part 1)

July 31. Structure of human genome & genes (Part 2)

Aug 5. Human genetic variations (Part 1)

Aug 7

Aug 12

Aug 14

Class Links

Navigation menu

Bioinformatics Workshop 2014: Difference between revisions

Revision as of 06:22, 28 July 2014

Course Description

Background

The workshop DOES NOT

Contents

Learning Goals

Textbook

Grading & Academic Honesty

Programming Assignment Expectations

Course Schedule (Tuesdays and Thursdays)

July 15

July 17

July 22

July 23 ANNOUNCEMENT

July 24

MIDTERM

July 29. Structure of human genome & genes (Part 1)

July 31. Structure of human genome & genes (Part 2)

Aug 5. Human genetic variations (Part 1)

Aug 7

Aug 12

Aug 14

Class Links

Navigation menu

Search