Genome Notes

by moodyharsh

Biological Organisms can be classified as:



Eukaryotes -> Cell Has Nucleus
Prokaryotes -> Does Not

Cells -> Tissue -> Organ -> Organ System -> Organism

Cells contain Macro Molecules (Cell Wall, DNA, Ribosome) and Micro Molecules (Water, Salt)

DNA Holds Protein manufacturing Information.

Much like 0 and 1, we have A,T,G,C Nucleotides.
Much like a byte, we have a codon - A Nucleotide triple.

Gene == Some set of codons which play a role in the body.
All genes == genome or genotype

Sexual Organisms has 2 genomes, one from each parent.

DNA/2 == RNA
During Protein creation and Cell replication, various RNA’s like mRNA, tRNA , rRNA become important.

mtDNA is DNA of mitochondria, another independent component of the cell.

Somes genes are collected and stored independently as a chromosome.
Chromosome looks like an X

A gene is mapped from the X’s center, to a position called locus.

Thus, Genome = set of chromosomes

All cells have the same genome.
But cells also have restricting macromolecules, which make each cell of a tissue unique.

For Humans,
Each cell has two pairs of chromosomes
23 chromosomes from each parent (Total 46)
Reproductive cells have only 23
Approx 30k genes
99% of genes are useless
0.9% are common to all and useful
0.1% are useful and unique

Experimentally extracted gene code is called an allele.

A human has 2 alleles, this results in zygosity for each allele - competition, fitness for survival.

For the useful 1% genes, we have mutations.
SNP’s are some significant types of mutations observed in populations.
Some mutations disease, some spider man.

Haplotype is a set of SNP’s with statistical significance.

Genebank -> Whole Genome with redundancy

dbSNP-> Just Mutations with locus information

A lot of submitted SNP’s are redundant.

RefSNP-> A non-redundant version of dbSNP. More useful for research.

Formats of SNP available online

This format is commonly used to store genetic sequences

SNP Type

sql.gz and bcp.gz
MSSQL DDL dump and the Data Dump in bcp (tab separated values)

SQL like DDL which is described like Protocal Buffers or C Structs.
Multiple backends to actually store the data – binary, xml, text.
Primarily used as a research data exchange format. Kind of like JSON for labs.


This DDL schema basically catalogues, SNP data.

How is dbSNP stored ?

A form “HANDLE” is given to any one who wants to upload.

The data is uploaded into the database.

After a certain period, a build occurs. This generates the following important files,

ASN.1 flat file for every orgranism
XML dump
VCF files
ss_fata and rs_fasta sequences of the SNP’s:

And organism specific database dumps,

Data common to all the Submitted SNP’s is in

When choosing sequences to submit an SNP, points to take into consideration may include:

  • Is the SNP a “double hit” SNP?
  • Is there Minor Allele Frequency (MAF) data available for a SNP?
  • Has this SNP been identified in the population, e.g., ethnic group, that you are examining?

Such biological qualifiers give confidence that a given SNP is well studied and may be useful as a marker in your particular study.

Input -> An allele, a candidate SNP

The Input is Blasted and we need to list close matches with the following data,

Gene and position
Design strand
Reference ID and database obtained from (with build number)
If validated and / or genotyped publically
Other SNP’s nearby and associated SNP/s
Disease association
Paper references
Minor Allele Frequency (MAF) and Allele Type

NCBI Documentation assume you know why’s and what’s -