Biological Organisms can be classified as:
Eukaryotes -> Cell Has Nucleus
Prokaryotes -> Does Not
Cells -> Tissue -> Organ -> Organ System -> Organism
Cells contain Macro Molecules (Cell Wall, DNA, Ribosome) and Micro Molecules (Water, Salt)
DNA Holds Protein manufacturing Information.
Much like 0 and 1, we have A,T,G,C Nucleotides.
Much like a byte, we have a codon - A Nucleotide triple.
Gene == Some set of codons which play a role in the body.
All genes == genome or genotype
Sexual Organisms has 2 genomes, one from each parent.
DNA/2 == RNA
During Protein creation and Cell replication, various RNA’s like mRNA, tRNA , rRNA become important.
mtDNA is DNA of mitochondria, another independent component of the cell.
Somes genes are collected and stored independently as a chromosome.
Chromosome looks like an X
A gene is mapped from the X’s center, to a position called locus.
Thus, Genome = set of chromosomes
All cells have the same genome.
But cells also have restricting macromolecules, which make each cell of a tissue unique.
Each cell has two pairs of chromosomes
23 chromosomes from each parent (Total 46)
Reproductive cells have only 23
Approx 30k genes
99% of genes are useless
0.9% are common to all and useful
0.1% are useful and unique
Experimentally extracted gene code is called an allele.
A human has 2 alleles, this results in zygosity for each allele - competition, fitness for survival.
For the useful 1% genes, we have mutations.
SNP’s are some significant types of mutations observed in populations.
Some mutations disease, some spider man.
Haplotype is a set of SNP’s with statistical significance.
Genebank -> Whole Genome with redundancy
dbSNP-> Just Mutations with locus information
A lot of submitted SNP’s are redundant.
RefSNP-> A non-redundant version of dbSNP. More useful for research.
Formats of SNP available online
This format is commonly used to store genetic sequences
sql.gz and bcp.gz
MSSQL DDL dump and the Data Dump in bcp (tab separated values)
SQL like DDL which is described like Protocal Buffers or C Structs.
Multiple backends to actually store the data – binary, xml, text.
Primarily used as a research data exchange format. Kind of like JSON for labs.
This DDL schema basically catalogues, SNP data.
How is dbSNP stored ?
A form “HANDLE” is given to any one who wants to upload.
The data is uploaded into the database.
After a certain period, a build occurs. This generates the following important files,
ASN.1 flat file for every orgranism ftp://ftp.ncbi.nih.gov/snp/organisms/arabidopsis_3702/ASN1_flat/
XML dump ftp://ftp.ncbi.nih.gov/snp/organisms/arabidopsis_3702/XML/
VCF files ftp://ftp.ncbi.nih.gov/snp/organisms/arabidopsis_3702/VCF/
ss_fata and rs_fasta sequences of the SNP’s: ftp://ftp.ncbi.nih.gov/snp/organisms/arabidopsis_3702/ss_fast ftp://ftp.ncbi.nih.gov/snp/organisms/arabidopsis_3702/rs_fasta
And organism specific database dumps,
Data common to all the Submitted SNP’s is in
When choosing sequences to submit an SNP, points to take into consideration may include:
- Is the SNP a “double hit” SNP?
- Is there Minor Allele Frequency (MAF) data available for a SNP?
- Has this SNP been identified in the population, e.g., ethnic group, that you are examining?
Such biological qualifiers give confidence that a given SNP is well studied and may be useful as a marker in your particular study.
Input -> An allele, a candidate SNP
The Input is Blasted and we need to list close matches with the following data,
Gene and position
Reference ID and database obtained from (with build number)
If validated and / or genotyped publically
Other SNP’s nearby and associated SNP/s
Minor Allele Frequency (MAF) and Allele Type
NCBI Documentation assume you know why’s and what’s -