UCSC Genome Browser | Encyclopedia MDPI

UCSC Genome Browser: History

Please note this is an old version of this entry, which may differ significantly from the current revision.

Subjects: Computer Science, Software Engineering

Contributor:

HandWiki

The UCSC Genome Browser is an on-line, and downloadable, genome browser hosted by the University of California, Santa Cruz (UCSC). It is an interactive website offering access to genome sequence data from a variety of vertebrate and invertebrate species and major model organisms, integrated with a large collection of aligned annotations. The Browser is a graphical viewer optimized to support fast interactive performance and is an open-source, web-based tool suite built on top of a MySQL database for rapid visualization, examination, and querying of the data at many levels. The Genome Browser Database, browsing tools, downloadable data files, and documentation can all be found on the UCSC Genome Bioinformatics website.

mysql
web-based tool
genome sequence data

1. History

Initially built and still managed by Jim Kent, then a graduate student, and David Haussler, professor of Computer Science (now Biomolecular Engineering) at the University of California, Santa Cruz in 2000, the UCSC Genome Browser began as a resource for the distribution of the initial fruits of the Human Genome Project. Funded by the Howard Hughes Medical Institute and the National Human Genome Research Institute, NHGRI (one of the US National Institutes of Health), the browser offered a graphical display of the first full-chromosome draft assembly of human genome sequence. Today the browser is used by geneticists, molecular biologists and physicians as well as students and teachers of evolution for access to genomic information.

2. Genomes

UCSC Genomes. By UCSC - https://genome.ucsc.edu, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=68576433

In the years since its inception, the UCSC Browser has expanded to accommodate genome sequences of all vertebrate species and selected invertebrates for which high-coverage genomic sequences is available,^[1] now including 46 species. High coverage is necessary to allow overlap to guide the construction of larger contiguous regions. Genomic sequences with less coverage are included in multiple-alignment tracks on some browsers, but the fragmented nature of these assemblies does not make them suitable for building full featured browsers. (more below on multiple-alignment tracks). The species hosted with full-featured genome browsers are shown in the table.

Genomes
great apes	human, baboon, bonobo, chimpanzee, gibbon, gorilla, orangutan
non-ape primates	bushbaby, marmoset, mouse lemur, rhesus macaque, squirrel monkey, tarsier, tree shrew
non-primate mammals	mouse, alpaca, armadillo, cat, Chinese hamster, cow, dog, dolphin, elephant, ferret, guinea pig, hedgehog, horse, kangaroo rat, manatee, Minke whale, naked mole-rat, opossum, panda, pig, pika, platypus, rabbit, rat, rock hyrax, sheep, shrew, sloth, squirrel, Tasmanian devil, tenrec, wallaby, white rhinoceros
non-mammal chordates	American alligator, Atlantic cod, budgerigar, chicken, coelacanth, elephant shark, Fugu, lamprey, lizard, medaka, medium ground finch, Nile tilapia, painted turtle, stickleback, Tetraodon, turkey, Xenopus tropicalis, zebra finch, zebrafish
invertebrates	Caenorhabditis spp (5), Drosophila spp. (11), honey bee, lancelet, mosquito, P. Pacificus, sea hare, sea squirt, sea urchin, yeast
viruses	Ebola, SARS-CoV-2 coronavirus

With assembly hubs users can load unique assemblies. An example can be seen in the Vertebrate Genomes Project assembly hub.

3. Browser Functionality

The large amount of data about biological systems that is accumulating in the literature makes it necessary to collect and digest information using the tools of bioinformatics. The UCSC Genome Browser presents a diverse collection of annotation datasets (known as "tracks" and presented graphically), including mRNA alignments, mappings of DNA repeat elements, gene predictions, gene-expression data, disease-association data (representing the relationships of genes to diseases), and mappings of commercially available gene chips (e.g., Illumina and Agilent). The basic paradigm of display is to show the genome sequence in the horizontal dimension, and show graphical representations of the locations of the mRNAs, gene predictions, etc. Blocks of color along the coordinate axis show the locations of the alignments of the various data types. The ability to show this large variety of data types on a single coordinate axis makes the browser a handy tool for the vertical integration of the data.

To find a specific gene or genomic region, the user may type in the gene name, a DNA sequence, an accession number for an RNA, the name of a genomic cytological band (e.g., 20p13 for band 13 on the short arm of chr20) or a chromosomal position (chr17:38,450,000-38,531,000 for the region around the gene BRCA1).

Presenting the data in the graphical format allows the browser to present link access to detailed information about any of the annotations. The gene details page of the UCSC Genes track provides a large number of links to more specific information about the gene at many other data resources, such as Online Mendelian Inheritance in Man (OMIM) and SwissProt.

Designed for the presentation of complex and voluminous data, the UCSC Browser is optimized for speed. By pre-aligning the 55 million RNAs of GenBank to each of the 81 genome assemblies (many of the 46 species have more than one assembly), the browser allows instant access to the alignments of any RNA to any of the hosted species.

Multiple gene products of FOXP2 gene (top) and evolutionary conservation shown in multiple alignment (bottom). By The original uploader was Dnaphd at English Wikipedia. - Transferred from en.wikipedia to Commons by IngerAlHaosului using CommonsHelper., BSD, https://commons.wikimedia.org/w/index.php?curid=8002512

The juxtaposition of the many types of data allow researchers to display exactly the combination of data that will answer specific questions. A pdf/postscript output functionality allows export of a camera-ready image for publication in academic journals.

One unique and useful feature that distinguishes the UCSC Browser from other genome browsers is the continuously variable nature of the display. Sequence of any size can be displayed, from a single DNA base up to the entire chromosome (human chr1 = 245 million bases, Mb) with full annotation tracks. Researchers can display a single gene, a single exon, or an entire chromosome band, showing dozens or hundreds of genes and any combination of the many annotations. A convenient drag-and-zoom feature allows the user to choose any region in the genome image and expand it to occupy the full screen.

Researchers may also use the browser to display their own data via the Custom Tracks tool. This feature allows users to upload a file of their own data and view the data in the context of the reference genome assembly. Users may also use the data hosted by UCSC, creating subsets of the data of their choosing with the Table Browser tool (such as only the SNPs that change the amino acid sequence of a protein) and display this specific subset of the data in the browser as a Custom Track.

Any browser view created by a user, including those containing Custom Tracks, may be shared with other users via the Saved Sessions tool.

3.1. Tracks

UCSC Genome Browser Tracks. By UCSC - https://genome.ucsc.edu, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=68576038

Below the displayed image of the UCSC Genome browser are nine categories of additional tracks that can be selected and displayed alongside the original data. These categories are Mapping and Sequencing, Genes and Gene Predictions, Phenotype and Literature, mRNA and EST, Expression, Regulation, Comparative Genomics, Variation, and Repeats.

Categories
Category	Description	Examples of track
Mapping and Sequencing	Allows control over the style of sequencing displayed.	Base Position. Alt Map, Gap
Genes and Gene Predictions	Which programs to predict genes and which databases to display known genes from.	GENCODE v24, Geneid Genes, Pfam in UCSC Gene
Phenotype and Literature	Databases containing specific styles of phenotype data.	OMIM Alleles, Cancer Gene Expr Super-track
mRNA and EST	Access to mRNAs and ESTs for human specific searches or general all purpose searches.	Human ESTs, Other ESTs, Other mRNAs
Expression	Display unique expressions of predetermined sequences.	GTEx Gene, Affy U133
Regulation	Information relevant to regulation of transcriptions from different studies.	ENCODE Regulation Super-track Settings, ORegAnno
Comparative Genomics	Allows the comparison of the searched sequence with other groups of animals with sequenced genomes.	Conservation, Cons 7 Verts, Cons 30 Primates
Variation	Compares the searched sequence with known variations.	Common SNPs(150), All SNPs(146), Flagged SNPs(144)
Repeats	Allows tracking of different kinds of repeated sequences in the query.	RepeatMasker, Microsatellite, WM + SDust

3.1.1. Mapping and Sequencing

These tracks allow for user control over the display of genomic coordinates, sequences, and gaps. Researchers have the ability to select tracks which best represent their query to allow for more applicable data to be displayed depending on the type and depth of research being done. The mapping and sequencing tracks can also display a percentage based track to show a researcher if a particular genetic element is more prevalent in the specified area.

3.1.2. Genes and Gene Predictions

The gene and gene predictions tracks control the display of genes and their subsequent parts. The different tracks allow the user to display gene models, protein coding regions, and non-coding RNA as well as other gene related data. There are numerous tracks available allowing researchers to quickly compare their query with pre-selected sets of genes to look for correlations between known sets of genes.

3.1.3. Phenotype and Literature

Phenotype and Literature tracks deal with phenotype directly linked with genes as well as genetic phenotype. The uses of these tracks are intended for use primarily by physicians and other professionals concerned with genetic disorders, by genetics researchers, and by advanced students in science and medicine. A researcher can also display a track that shows the genomic positions of natural and artificial amino acid variants.

3.1.4. mRNA and EST

These tracks are related to expressed sequence tags and messenger RNA. ESTs are single-read sequences, typically about 500 bases in length, that usually represent fragments of transcribed genes. The mRNA tracks allow the display of mRNA alignment data in Humans, as well as, other species. There are also tracks allowing comparison with regions of ESTs that show signs of splicing when aligned with the genome.

3.1.5. Expression

Expression tracks are used to relate genetic data with the tissue areas it is expressed in. This allows a researcher to discover if a particular gene or sequence is linked with various tissues throughout the body. The expression tracks also allow for displays of consensus data about the tissues that express the query region.

3.1.6. Regulation

The regulation tracks of the UCSC Genome browser are a category of tracks that control the representation of promoter and control regions within the genome. A researcher can adjust the regulation tracks to add a display graph to the genome browser. These displays allow for more detail about regulatory regions, transcription factor binding sites, RNA binding sites, regulatory variants, haplotypes, and other regulatory elements.

3.1.7. Comparative Genomics

The UCSC Genome Browser allows the user to display different kinds of conservation data. The user can select from different tracks including primates, vertebrates, mammals among others, and see how the gene sequence they searched is conserved amongst other species. The comparative alignments give a graphical view of the evolutionary relationships among species. This makes it a useful tool both for the researcher, who can visualize regions of conservation among a group of species and make predictions about functional elements in unknown DNA regions, and in the classroom as a tool to illustrate one of the most compelling arguments for the evolution of species. The 44-way comparative track on the human assembly clearly shows that the farther one goes back in evolutionary time, the less sequence homology remains, but functionally important regions of the genome (e.g., exons and control elements, but not introns typically) are conserved much farther back in evolutionary time.

3.1.8. Variation Data

Many types of variation data are also displayed. For example, the entire contents of each release of the dbSNP database from NCBI are mapped to human, mouse and other genomes. This includes the fruits of the 1000 Genomes Project, as soon as they are released in dbSNP. Other types of variation data include copy-number variation data (CNV) and human population allele frequencies from the HapMap project.

3.1.9. Repeats

The repeat tracks of the genome browser allow the user to see a visual representation of the DNA areas with low complexity repetitions. Being able to visualize repetitions in a sequence allows for quick inferences about a search query in the genome browser. A researcher has the potential to quickly see that their specified search contains large amounts of repeated sequences at a glance and adjust their search or track displays accordingly.

4. Analysis Tools

The UCSC site hosts a set of genome analysis tools, including a full-featured GUI interface for mining the information in the browser database, a FAST sequence alignment tool BLAT^[2] that is also useful for simply finding sequences in the massive sequence (human genome = 3.23 billion bases [Gb]) of any of the featured genomes.

A liftOver tool uses whole-genome alignments to allow conversion of sequences from one assembly to another or between species. The Genome Graphs tool allows users to view all chromosomes at once and display the results of genome-wide association studies (GWAS). The Gene Sorter displays genes grouped by parameters not linked to genome location, such as expression pattern in tissues.

5. Open Source / Mirrors

The UCSC Browser code base is open-source for non-commercial use, and is mirrored locally by many research groups, allowing private display of data in the context of the public data. The UCSC Browser is mirrored at several locations worldwide, as shown in the table.

official mirror sites
European mirror — maintained by UCSC at University of Bielefeld, Germany
Asian mirror — maintained by UCSC at RIKEN, Yokohama, Japan

The Browser code is also used in separate installations by the UCSC Malaria Genome Browser and the Archaea Browser.

The content is sourced from: https://handwiki.org/wiki/Biology:UCSC_Genome_Browser

References

"High-coverage" here means 6x coverage, or six times more total sequence than the size of the genome.
Kent, WJ. (Apr 2002). "BLAT - the BLAST-like alignment tool". Genome Res 12 (4): 656–64. doi:10.1101/gr.229202. PMID 11932250. PMC 187518. http://genome.cshlp.org/content/12/4/656.abstract.

©Text is available under the terms and conditions of the Creative Commons-Attribution ShareAlike (CC BY-SA) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.