Big Data in genetics
WEEK 1
Data, Data Science and Big Data
Data: a set of values of qualitative or quantitative variables (variable=measurement of an
object), and this data may come from all sorts of resources. Another point is an action that
can be done on this data, so the collection of the data in order to be examined and
considered and used to help decision making, so to make predictions (goal). So the goal of a
data scientist now is to collect this data from many different datasets and put them in an
analyzed structure to examine and explore them, to make predictions and long term
inferences.
Data science: an interdisciplinary field that uses scientific methods, processes, algorithms
and systems to extract knowledge and insights from structured and unstructured data. The
overall goal is to go from unstructured data, to raw data, to information.
Data scientist: a scientist that works with data collection and engineering, from this
unstructured to structured format, and then performs all sorts of analyses on top of this data.
Big data: a very large set of data (volume), that is produced by people using the internet
and that can only be stored, understood, and used with the help of special tools and
methods.
Volume relates to the scale of the data, and generally there is big
data with large datasets.
Variety relates to that in many cases this data comes from all sorts of
resources so many types to be integrated.
Velocity refers to the speed of which the data is required, and usually
we’re talking about data that come in continuously.
Veracity refers to the uncertainty of the data, meaning that there
might be errors of noise, missing data.
Value refers to the fact that in the end in big data we want to analyze the data but also we
want to add additional value to all these different types of resources, meaning the
information extracted from this data is the additional value.
Why now?
Data collection is getting bigger and bigger and cheaper and cheaper, so the availability of
data is becoming higher and higher, and there is also a fast growing availability of computer
power.
WEEK 2
Data Report
Reporting data: with a scientific paper, which starts with a question/goal with a bit of
introduction and background information, then a study design, including the matrices, the
origin of the data, how we are going to examine the data, the statistical analysis, then an
exploratory analysis, like a correlation analysis, and finally the results, reporting the
prediction model and accuracy. Then also discussion and conclusions.
Data report: most part of the research is not done in academia, less used to scientific
papers, and the data report came in as the alternative. A data report is an evaluation tool to
,assess the past present and future business information while keeping track of the overall
performance of a company. It combines various business data, and is usually used both on
an operational or strategic level of decision-making. Some sort of evaluation tool or
description in which we integrate both past results, which would be our current data, as well
as our goal, to make predictions from this data of future events.
To write a data report there are different key points to follow: define the type of your report,
the field in which it’s gonna be used; define the audience; have a plan, depending on the
message I want to convey to my audience; be objective; be visually stunning, so good
figures; tell a story, a narrative; keep it simple, way shorter than a scientific paper; make sure
your report is actionable, meaning that it should have actions related to it, so that the
audience could possibly act on this data.
Heritability and genome-wide association studies
Field of human genetics: rapid changes occurred in the past decades, like new
technologies, novel methods, large scale collaborations, novel disease insight. There are still
some issues in this field, like the relative influence of G and E, the nature of G (additive or
non-additive), or determining causal mechanisms, especially with polygenic traits.
Twin studies: MZ twins share 100% of their genes, 100% of their shared environment, and
0% of non-shared environment, while DZ twins share 50% of their genes, 100% of their
shared environment, and 0% of non-shared environment. Knowing this helps in discovering
the heritability of the genes. If the gene is 100% heritable and additive, the
similarity=correlation r between MZ and DZ is 1 to 2 respectively. Instead, if the environment
is influencing the differences between individuals, then the correlation coefficients will be the
same r. Since many studies about the heritability have been done, and they all convey
different percentages, they carried out a meta analysis to join all data from all twin studies in
order to be able to get point estimates that can be used globally. 3000 studies from 1968 to
2012. From that study they concluded that almost all genes have shown a heritability of
50%. The main conclusions are that all traits are heritable to some extent, also that the
influence of the environment was overall relatively small, and also that the majority of traits
are consistent with a model where all genetic variance is additive.
Heritability: proportion of trait variance attributable to genetic variance, the extent to which
observed individual differences can be traced back to genetic differences, causes genetically
related individuals to correlate on a trait, and suggests that variations in genes underlie trait
differences between individuals. The right interpretation is that if a trait is 80% heritable it
means that the difference between my phenotype and another person’s phenotype is
explained for the 80% by genetics.
DNA facts:
The 4 nucleotides are universal, and the sequence of 3 codes for the same amino acid in
every living being, but is the total sequence of 3 ´ 10^9 nucleotides the same for all living
beings?
99.9% of DNA sequence is the same between individuals, with approximately 10^6 sites that
differ between individuals, resulting in phenotypic differences.
SNPs, 0.1%: genetic variations, which can occur in a gene, which codes for a protein, is a
regulatory region, is exonic or intronic; or can occur outside genes, so in a regulatory
segment or of currently unknown function. These genetic variations can be harmless,
, harmful, latent, or silent. They are due to
mutations (base pair level), recombination (parts of
chromosomes level) or segregation (combination
of chromosomes level).
Monogenic disorders are influenced by one gene,
so by one mutation, following the Mendelian
disorders, and most genetic causes are already
known, whose effects are very large. Polygenic
disorders are influenced by multiple genes, the
genetic causes are mostly unknown, and they are
referred to as complex disorders, because multiple
interactions between genetic and environmental
factors are involved.
Importance (picture).
How do we find these genes?
Candidate gene studies have been carried out between 1980s and 1990s, and these refer to
the pre-selection of several genes based on prior knowledge and on convenience, then
testing for the association between these genes and the traits. GWAS have been going on
since 2006.
GWAS strategy: genotype a large set of individuals on 1 million SNPs, which are correlated
so there is no need to genotype all the 3x10^9 pairs, and for each SNP compare the allele
frequency across cases and controls, and conduct a statistical test for a difference in
frequency. This is done through microarrays, which can contain more than 1 million tagging
SNPs, covering the genome in high density, so a revolution from 2006. The result of GWAS
is the Manhattan plot, which shows the alleles which differ significantly so the base pairs that
might cause the phenotype differences, more precisely it shows many points which
represent every study done and its p value.
Advantages: GWAS may identify several possible loci as it spans the whole genome;
relationship between loci may identify new biological pathways; and results from multiple
studies can be integrated, aiding the prioritization of genes for replication and increasing
statistical confidence.
Disadvantages: there is an increased likelihood of false positives; risk of population
stratification; a large number of samples is needed; and vast amounts of data needs to be
analyzed and produced.
Power to detect genes of a very small effect size: need to
have large sample sizes to ultimately allow the detection
of genetic variants with very small effects.
Schizophrenia: 131 significant hits which can explain less
than 3% of liability to scz, however the risk associated with
8300 genetic variances (polygenic risk score) explains
32% of the liability. This means that there are a lot of
genetic variances still undetected and because of this,
they might also be smaller.
These needed huge sample sizes can be achieved now, so the GWAS results are reliable
and many genes are now discovered. However, some of these genes are found in gene
deserts, meaning that they have given no functional information, but we know that the
majority of human complex traits probably are caused by thousands of genes of very small
effects.
WEEK 1
Data, Data Science and Big Data
Data: a set of values of qualitative or quantitative variables (variable=measurement of an
object), and this data may come from all sorts of resources. Another point is an action that
can be done on this data, so the collection of the data in order to be examined and
considered and used to help decision making, so to make predictions (goal). So the goal of a
data scientist now is to collect this data from many different datasets and put them in an
analyzed structure to examine and explore them, to make predictions and long term
inferences.
Data science: an interdisciplinary field that uses scientific methods, processes, algorithms
and systems to extract knowledge and insights from structured and unstructured data. The
overall goal is to go from unstructured data, to raw data, to information.
Data scientist: a scientist that works with data collection and engineering, from this
unstructured to structured format, and then performs all sorts of analyses on top of this data.
Big data: a very large set of data (volume), that is produced by people using the internet
and that can only be stored, understood, and used with the help of special tools and
methods.
Volume relates to the scale of the data, and generally there is big
data with large datasets.
Variety relates to that in many cases this data comes from all sorts of
resources so many types to be integrated.
Velocity refers to the speed of which the data is required, and usually
we’re talking about data that come in continuously.
Veracity refers to the uncertainty of the data, meaning that there
might be errors of noise, missing data.
Value refers to the fact that in the end in big data we want to analyze the data but also we
want to add additional value to all these different types of resources, meaning the
information extracted from this data is the additional value.
Why now?
Data collection is getting bigger and bigger and cheaper and cheaper, so the availability of
data is becoming higher and higher, and there is also a fast growing availability of computer
power.
WEEK 2
Data Report
Reporting data: with a scientific paper, which starts with a question/goal with a bit of
introduction and background information, then a study design, including the matrices, the
origin of the data, how we are going to examine the data, the statistical analysis, then an
exploratory analysis, like a correlation analysis, and finally the results, reporting the
prediction model and accuracy. Then also discussion and conclusions.
Data report: most part of the research is not done in academia, less used to scientific
papers, and the data report came in as the alternative. A data report is an evaluation tool to
,assess the past present and future business information while keeping track of the overall
performance of a company. It combines various business data, and is usually used both on
an operational or strategic level of decision-making. Some sort of evaluation tool or
description in which we integrate both past results, which would be our current data, as well
as our goal, to make predictions from this data of future events.
To write a data report there are different key points to follow: define the type of your report,
the field in which it’s gonna be used; define the audience; have a plan, depending on the
message I want to convey to my audience; be objective; be visually stunning, so good
figures; tell a story, a narrative; keep it simple, way shorter than a scientific paper; make sure
your report is actionable, meaning that it should have actions related to it, so that the
audience could possibly act on this data.
Heritability and genome-wide association studies
Field of human genetics: rapid changes occurred in the past decades, like new
technologies, novel methods, large scale collaborations, novel disease insight. There are still
some issues in this field, like the relative influence of G and E, the nature of G (additive or
non-additive), or determining causal mechanisms, especially with polygenic traits.
Twin studies: MZ twins share 100% of their genes, 100% of their shared environment, and
0% of non-shared environment, while DZ twins share 50% of their genes, 100% of their
shared environment, and 0% of non-shared environment. Knowing this helps in discovering
the heritability of the genes. If the gene is 100% heritable and additive, the
similarity=correlation r between MZ and DZ is 1 to 2 respectively. Instead, if the environment
is influencing the differences between individuals, then the correlation coefficients will be the
same r. Since many studies about the heritability have been done, and they all convey
different percentages, they carried out a meta analysis to join all data from all twin studies in
order to be able to get point estimates that can be used globally. 3000 studies from 1968 to
2012. From that study they concluded that almost all genes have shown a heritability of
50%. The main conclusions are that all traits are heritable to some extent, also that the
influence of the environment was overall relatively small, and also that the majority of traits
are consistent with a model where all genetic variance is additive.
Heritability: proportion of trait variance attributable to genetic variance, the extent to which
observed individual differences can be traced back to genetic differences, causes genetically
related individuals to correlate on a trait, and suggests that variations in genes underlie trait
differences between individuals. The right interpretation is that if a trait is 80% heritable it
means that the difference between my phenotype and another person’s phenotype is
explained for the 80% by genetics.
DNA facts:
The 4 nucleotides are universal, and the sequence of 3 codes for the same amino acid in
every living being, but is the total sequence of 3 ´ 10^9 nucleotides the same for all living
beings?
99.9% of DNA sequence is the same between individuals, with approximately 10^6 sites that
differ between individuals, resulting in phenotypic differences.
SNPs, 0.1%: genetic variations, which can occur in a gene, which codes for a protein, is a
regulatory region, is exonic or intronic; or can occur outside genes, so in a regulatory
segment or of currently unknown function. These genetic variations can be harmless,
, harmful, latent, or silent. They are due to
mutations (base pair level), recombination (parts of
chromosomes level) or segregation (combination
of chromosomes level).
Monogenic disorders are influenced by one gene,
so by one mutation, following the Mendelian
disorders, and most genetic causes are already
known, whose effects are very large. Polygenic
disorders are influenced by multiple genes, the
genetic causes are mostly unknown, and they are
referred to as complex disorders, because multiple
interactions between genetic and environmental
factors are involved.
Importance (picture).
How do we find these genes?
Candidate gene studies have been carried out between 1980s and 1990s, and these refer to
the pre-selection of several genes based on prior knowledge and on convenience, then
testing for the association between these genes and the traits. GWAS have been going on
since 2006.
GWAS strategy: genotype a large set of individuals on 1 million SNPs, which are correlated
so there is no need to genotype all the 3x10^9 pairs, and for each SNP compare the allele
frequency across cases and controls, and conduct a statistical test for a difference in
frequency. This is done through microarrays, which can contain more than 1 million tagging
SNPs, covering the genome in high density, so a revolution from 2006. The result of GWAS
is the Manhattan plot, which shows the alleles which differ significantly so the base pairs that
might cause the phenotype differences, more precisely it shows many points which
represent every study done and its p value.
Advantages: GWAS may identify several possible loci as it spans the whole genome;
relationship between loci may identify new biological pathways; and results from multiple
studies can be integrated, aiding the prioritization of genes for replication and increasing
statistical confidence.
Disadvantages: there is an increased likelihood of false positives; risk of population
stratification; a large number of samples is needed; and vast amounts of data needs to be
analyzed and produced.
Power to detect genes of a very small effect size: need to
have large sample sizes to ultimately allow the detection
of genetic variants with very small effects.
Schizophrenia: 131 significant hits which can explain less
than 3% of liability to scz, however the risk associated with
8300 genetic variances (polygenic risk score) explains
32% of the liability. This means that there are a lot of
genetic variances still undetected and because of this,
they might also be smaller.
These needed huge sample sizes can be achieved now, so the GWAS results are reliable
and many genes are now discovered. However, some of these genes are found in gene
deserts, meaning that they have given no functional information, but we know that the
majority of human complex traits probably are caused by thousands of genes of very small
effects.