Genomica – bioinformatica DT-2
Hoorcollege 1 - BLAST
Recap: -omics (omics: het sequencen van alles van iets) (meta: kijken naar alle organismen ipv 1)
- genomics: sequence all of the DNA of one organism
- transcriptomics: sequence all of the mRNA in an organism/tissue/cell
- proteomics: sequence all of the proteins in an organism/tissue/cell
- metagenomics: sequence the DNA of all organisms in a sample
- metatranscriptomics: sequence the mRNA of all organisms in a sample
- metaproteomics: sequence the proteins of all organisms in a sample
Hoe werkt metagenomics:
- de pakt een sample (bijv. koraal, zeewater, stuk darm, etc)
- filteren zodat je dingen kwijtraakt waar je niet naar wilt kijken
- dan hou je alleen de micro-organismen over
The biology behind the omics revolution
- omics solves a major problem in the science: biases
- people are mostly interested in: their diseases, their
food, themselves
- this causes biases in our general understanding of
biology, and biases in our databases. For example: most
studied bacteria are associated with humans
Data and the bioinformatician
- bioinformaticians use data in two different ways:
- 1: question first / top down: given a biological question, a good bioinformatician will immediately
think about which datasets could be used to answer it
- 2: data fist/ bottom up: given a dataset, a good bioinformatician will immediately think about which
biological hypothesis it could help to test
Bioinformatics
- the study of informatic process in biotic systems
Bioinformatic data analysis
- using computational methods to analyze biological data
What to do with tons of different sequences?
- searching a database: we want to find a query sequence in the database
- but why? → if two sequences are similar we assume that they are related or have a common
ancestor
- show database of 300k genomes and illustrate how we want to find the best hit of a given query
- we have to break down the search because of possible mutations
,k-mer searches
- sequences can be divided into shorter subsequences or k-mers (k-mers consist of k nucleotides or
amino acids)
- we can make an index of all k-mers that occur in the database sequences
- if we split a query into k-mers of the same length, we can rapidly identify all the database
sequences containing them
- but: we limit ourselves to exact matches
natural sequence divergence
- if we align metagenomics sequencing reads to a reference genome, we can distinguish multiple
distinct SAR86 strains
- the sequences at the top (~97% identity)
belong to a strain that is closely related to the
reference genome
- the sequences below (~60 – 80% identity) are
more distantly related strains
pairwise sequence alignments
- given two sequences: seqX = X1X2…XM and seqY = Y1Y2….YN
an alignment is an assignment of gaps to positions 0, …, M in x, and to positions 0,…,N in seqY, so as
to line up each letter in one sequence with either a letter of a gap in the other sequence
- je zet de sequences boven elkaar zodat er zo veel mogelijk overeenkomsten zijn
searching a database
- could we make sequence alignments between the query and every database sequence? →
theoratically, yes but it takes a long time
best of both worlds
- using an k-mer search (= index search) will be very fast… but limits you to the exact matches
- making all possible pairwise alignments will let you find distantly related sequences as well …. But it
would take a very long time
- the solution is to combine the best of both worlds: quickly find potential hit using exact k-mers
stored in an index and make pairwise alignment, but only for potential hits
basic local alignment search tool (BLAST)
- BLAST finds similar sequences at reasonable speed (10-50x faster than previous algorithms)
- terminology: query – sequence we search the database with. Hit or subject: similar sequence found
in the database
- BLAST is the most used bioinformatics program → more than 100.000 queries per day on the NCBI
BLAST server
- even faster algorithms are now available
the BLAST search algorithm
- 1: identifies all words (length W) in the query (default lengths: W = 3 for protein, W = 11 for DNA,
based on substitution scores)
- 2: quickly finds similar words in the database (similar words are defined by using the substitution
, matrix, the index quickly locates all potential hits seqs
- 3: extends seeds in both directions to find HSPs between query and hit (HSP: region that can be
aligned with a score above a certain threshold
Hoorcollege 1 - BLAST
Recap: -omics (omics: het sequencen van alles van iets) (meta: kijken naar alle organismen ipv 1)
- genomics: sequence all of the DNA of one organism
- transcriptomics: sequence all of the mRNA in an organism/tissue/cell
- proteomics: sequence all of the proteins in an organism/tissue/cell
- metagenomics: sequence the DNA of all organisms in a sample
- metatranscriptomics: sequence the mRNA of all organisms in a sample
- metaproteomics: sequence the proteins of all organisms in a sample
Hoe werkt metagenomics:
- de pakt een sample (bijv. koraal, zeewater, stuk darm, etc)
- filteren zodat je dingen kwijtraakt waar je niet naar wilt kijken
- dan hou je alleen de micro-organismen over
The biology behind the omics revolution
- omics solves a major problem in the science: biases
- people are mostly interested in: their diseases, their
food, themselves
- this causes biases in our general understanding of
biology, and biases in our databases. For example: most
studied bacteria are associated with humans
Data and the bioinformatician
- bioinformaticians use data in two different ways:
- 1: question first / top down: given a biological question, a good bioinformatician will immediately
think about which datasets could be used to answer it
- 2: data fist/ bottom up: given a dataset, a good bioinformatician will immediately think about which
biological hypothesis it could help to test
Bioinformatics
- the study of informatic process in biotic systems
Bioinformatic data analysis
- using computational methods to analyze biological data
What to do with tons of different sequences?
- searching a database: we want to find a query sequence in the database
- but why? → if two sequences are similar we assume that they are related or have a common
ancestor
- show database of 300k genomes and illustrate how we want to find the best hit of a given query
- we have to break down the search because of possible mutations
,k-mer searches
- sequences can be divided into shorter subsequences or k-mers (k-mers consist of k nucleotides or
amino acids)
- we can make an index of all k-mers that occur in the database sequences
- if we split a query into k-mers of the same length, we can rapidly identify all the database
sequences containing them
- but: we limit ourselves to exact matches
natural sequence divergence
- if we align metagenomics sequencing reads to a reference genome, we can distinguish multiple
distinct SAR86 strains
- the sequences at the top (~97% identity)
belong to a strain that is closely related to the
reference genome
- the sequences below (~60 – 80% identity) are
more distantly related strains
pairwise sequence alignments
- given two sequences: seqX = X1X2…XM and seqY = Y1Y2….YN
an alignment is an assignment of gaps to positions 0, …, M in x, and to positions 0,…,N in seqY, so as
to line up each letter in one sequence with either a letter of a gap in the other sequence
- je zet de sequences boven elkaar zodat er zo veel mogelijk overeenkomsten zijn
searching a database
- could we make sequence alignments between the query and every database sequence? →
theoratically, yes but it takes a long time
best of both worlds
- using an k-mer search (= index search) will be very fast… but limits you to the exact matches
- making all possible pairwise alignments will let you find distantly related sequences as well …. But it
would take a very long time
- the solution is to combine the best of both worlds: quickly find potential hit using exact k-mers
stored in an index and make pairwise alignment, but only for potential hits
basic local alignment search tool (BLAST)
- BLAST finds similar sequences at reasonable speed (10-50x faster than previous algorithms)
- terminology: query – sequence we search the database with. Hit or subject: similar sequence found
in the database
- BLAST is the most used bioinformatics program → more than 100.000 queries per day on the NCBI
BLAST server
- even faster algorithms are now available
the BLAST search algorithm
- 1: identifies all words (length W) in the query (default lengths: W = 3 for protein, W = 11 for DNA,
based on substitution scores)
- 2: quickly finds similar words in the database (similar words are defined by using the substitution
, matrix, the index quickly locates all potential hits seqs
- 3: extends seeds in both directions to find HSPs between query and hit (HSP: region that can be
aligned with a score above a certain threshold