NCBI NOW, Lecture 1, NCBI Resources for NGS

My name is Ben Busby, and I am the Genomics Outreach
Coordinator for NCBI. Welcome to the 2015
NCBI Next Generation Sequencing Online Workshop. Today I am going
to present a set of slides which introduces genomics
resources at NCBI. At the end of that presentation
I will be presenting also some community resources
for genomics, some thoughts on communicating
with bioinformaticians, an overview of the other five to
six lectures in this workshop, how to get a hands-on tutorial
for RNAseq mapping, as well as the logistics
of this course and where to ask questions. So without further ado,
let’s begin. What I’m going to talk
about today is terminology, NCBI health resources, BioProject, BLASTing into SRA, what is dbGaP, how to get
our reference genomes, as well as how
to get large datasets, and where to get
more information about NCBI resources. First, I’d like
to do a basic overview of next-generation sequencing, particularly
in the DNAseq space. So in the DNAseq space one
would, from a sequencer, obtain FASTQ files. FASTQ files should always
be trimmed for quality, and we will go
over that in the second lecture of this workshop series. Once these FASTQ files are
trimmed for quality a reference genome sequence
should be obtained– I’m going to tell you
about that in about 30 minutes– and then these reads should be
mapped to a reference genome. That we will go over in
lecture four of this workshop. Finally, variants can be
called from that mapping. We’ll go over that at the end of
lecture four in this workshop. A more advanced technique,
genome assembly, will not be covered
in this workshop, but for completeness, it is outlined in the right
column of this diagram. Before we launch in
to NCBI resources, I would like to give an overview of terms used
in next-generation sequencing. For that I’m going
to go to this bitly, which you can use as well. When we go
to this NCBI genomics glossary we can see a stylized version
of the diagram we saw before. This highlights
FASTQ sequence reads, mapping to reference genomes, and then making variant calls
and annotating variants. On the other side
of things is a simple de novo assembly pipeline,
as I mentioned earlier. Perhaps the most
important file type in all of next-general
sequencing is a BAM file. These are binary containing reads aligned
to a reference genome. It is important to remember
that the human readable analog of these files are
called SAM files; however, the reason we typically
keep those files in binary is they’re about 16%
of the size in this format. Coverage refers
to how many reads cover the average point in a genome. A typical read coverage
for a human genome might be in the range
of 4 to 30 X. A FASTA file is the most
common sequence format used for biological sequences. It can have a variety
of file extensions. FASTQ files are the most
common unaligned data format for next-generation sequencing. The Genome Reference Consortium
is a consortium of NCBI, the Sanger Institute, the Genome Institute
of Washington University in St. Louis, and EBI. This is a group responsible for the improvement
of the human mouse and zebra fish reference
genome assemblies. If you’ve ever wondered what
the GRC in GRCh or GRCm means, this is what it is. GFF and GTF files are standard nine-column
tab delimited formats for representing
annotated features and alignments
in genome sequence coordinates, and are marked
by those file extensions. Indexes are a set
of files created when a reference genome is
prepared for mapping. And mapping,
as we discussed before, aligning short reads
to reference genomes. We’ve discussed reference
genomes a few times. SAM files are the human
readable analogs and BAM files. In other words, these are FASTQ files mapped
to reference genomes and represented
in a standard file format, which we will discuss
at length later in the course. NCBI databases that provide
these data types to users will be covered in detail
during this presentation. However, for a quick
reference you can go back to this glossary
and look them up. For completeness, we’ve also included
some de novo sequencing terms; however, those terms are
beyond the scope of this course, so we will skip them for today. Downstream analysis file formats
are also quite important. The most important
downstream analysis format, in my opinion, is the VCF. This is a tab
delimited file format for communicating variant data
and related metadata. Now we will go back
to the presentation. Now we’re going to talk about
some very popular NCBI resources for obtaining genomic data. Say I was to search tuberculosis in the NCBI home screen
or Entrez, I would get several million hits across a whole bunch
of NCBI databases. I want to show you
a few specific ones and then take a general approach to obtaining genomic
data from NCBI. The first database I’d like
to show you in ClinVar. ClinVar is a database
of sequence variants asserted to be associated
with phenotypic variants by particular submitters. As you can see, we have some genetic variants
linked to susceptibility to tuberculosis. We can take a look
at some of those variants and we can see that they are
related to susceptibility. Another thing one can do
at the ClinVar website is look at the evidence coming
from different submitters. In this particular case,
two submitters have claimed that had a particular
variant is pathogenic and who have claimed
it’s protected. Now that could be true given
the particular environmental and bacterial context
that underlie these cases or lack thereof of tuberculosis. We can also look at genes
at the NCBI Variation Viewer. This give us a view
of specific isoforms of genes we may
be interested in, ClinVar variants,
dbSNP variants, and also structural
variants from dbVar. We can search Variation Viewer
by location, gene names, or phenotypes depending
on what we’re interested in. Many bioinformatic
students ask me is there a one-stop shop for molecular
information in bioinformatics. Even before I worked at NCBI, my answer would be NCBI gene. The reason for that is
for each of these gene entities there are a wealth of links
to other resources, both at NCBI, as well
as outside of NCBI. One of the most exciting things about the NCBI gene pages is
the sequence viewer. By clicking on “Tracks”
one can expand out on the aggregate exon common
coverage across 24 human tissues to look at different isoforms
in different cell types, as well as to upload data
from a particular computer or from the Internet. Once again,
those functions are accomplished by clicking
on the Tracks button. Sequences pertaining
to particular isoforms can also be downloaded
by clicking on the “Tools” menu or by simply right clicking
on the isoform of interest. I’d like to show you
one more NCBI resource before we jump
in to analyzing genomic data, and that NCBI resource is GTR. GTR is the Genetic
Testing Registry at NCBI. This is a resource that can
be used by clinicians to order genetic tests
for their patients. Many of these tests are provided
in an easy-to-access format by commercial testing companies. Now I’d like to show you
how to jump in to the peta bases of genomic data housed at NCBI. The first tool I like to use for delving into
NCBI genomic data is BioProject. BioProject will show
a lot of information on a subject such
as tuberculosis, but it’s also facetable, meaning that one can look
at particular data types. For example, I could
use these facets to look at RNAseq data specifically by clicking
on “Transcriptome” and SRA. SRA is where our raw sequencing reads from next-generation
sequencing are kept and transcriptome implies
that this is RNA-based data. If I do that, I would
get 34 results. However, it’s important
to note that the slide it was originally
made several months ago, and the number of
results has grown significantly. At this point I would challenge
you to go to BioProject, search for tuberculosis,
filter for RNAseq, and see how many results
you get today. If we dig into a particular
BioProject, this particular one
for RNAseq of human melanoma, we can see that we can get
a variety of data types; SRA experiments,
publications in PubMed, as well as sample information and GEO datasets processed
by the particular user. If I click on the number of SRA
runs on a particular BioProject this one being clinical isolates
of mycobacterium tuberculosis, I get to the SRA database,
and I can see all of those runs. I can go into individual runs and look at technical
metadata related to those runs. I can go further
into the browser and look at even more metadata. Some runs in SRA are aligned. If they are aligned,
then an alignment tab will appear and I can look
at them in the sequence viewer. This is the homo sapiens APO-EG, and I can clearly see a head
in one of the intros. Going back to RNAseq data
in tuberculosis, if I look at RNAseq data for
multifunctional memory T cells I can go to the SRA page
from BioProject, as I showed you before, and I can also click
on a particular experiment. This experiment will
show me the number of runs, and I can also click
on the run selector and get a variety of metadata
about the particular experiment. Currently, many studies
in the run selector in SRA have quite
a bit of metadata. Why would I want to use SRA? Because I can slice a particular
region out of the SRA, and as we’re going
to show you later, we’ve enabled two popular tools, GATK, as well as high set, to work with DNAseq data and
RNAseq data in SRA respectively. However, I would just like to mention using
the SRA tool kit directly, I can slice out
a particular aligned region. Why would I want to do this? Because I can slice out
a particular region of breast cancer cell lines, look for some particular
mutants, and compare that with
the thousand genomes, which is exome sequencing
of about 2,600 phenotypically normal humans. The SRA toolkit has a lot
of options, and those are easy to read
in the help documents provided. Also, available
on our github site we have a functional
read collection, API, and from the github site
you can get all kinds of neat examples like going
from SRA data into Spark. Also, the SRA toolkit is
available in the tool shed in Galaxy, which we will show you in the
next lecture of this workshop. However, what I’d like
to show you now is a way for anyone to get into SRA data, whether they have prior
experience with genomics or not, and that’s by using
the popular NCBI tool, BLAST. In in-person workshops when we
ask how many people interested in genomics have used blast
previously, we get a response of 96
or 97% of attendees. So if I search BioProject
for something and find, for example, some mice that have been
infected with influenza A, I can go to the BioProject and once again go
to a number of SRA experiments. And by selecting them and
clicking on the “Send-to” menu, I can send them to BLAST. I then get my normal BLAST page and I can select
an accession number or a FASTA sequence
to BLAST into the successions. My first thought when constructing this demo was
to BLAST interferon gamma into these SRA accessions, as interferon gamma is commonly
upregulated during viral expression. So I can go to the nucleotide
database, select mRNA, select a FASTA sequence, and get an mRNA sequence
for interferon gamma. I would suggest
when BLASTing into SRA to cut off the poly A tail, for reasons that should
be somewhat obvious. If they not obvious, I would highly recommend
that you try it both ways. So I can paste my sequence
into the normal BLAST webpage and run BLAST. Unfortunately in this example I get not very many
significant hits. At fist this was
disheartening to me. However, with a quick look
at the literature we can see that influenza A does
not stimulate the production of interferon gamma. If we find an mRNA that interferon gamma does
stimulate the production of, we can blast it
into the same sets and come up with a lot of hits. Additionally, if you wanted to
load your own data into SRA for the purpose of being able
to BLAST into it, you could use the LATF loader
shown in this github repository. There are a bunch
of other cool tools in some other
github repositories, like the one shown here. Another new BLAST tool
that may be of interest to those interested
in genomics is MOLE-BLAST, which will
automatically make OTUs from well-conserved
sequences in bacteria, fungi, or even viruses. If you’re interested,
for more information, please go
to the MOLE-BLAST website and click on the “Help” tab. Now I’d like the switch gears and talk a little bit
about dbGaP. DbGaP is the NCBI database
of genotypes and phenotypes. This is where dbGaP
fits into the NCBI pantheon. Currently dbGaP has over 1.2
million sample from over a million individuals. Investigators can get to dbGaP
data through authorized access, as it is protected by U.S. law. DbGaP also has a number
of analysis sets, such as these P values
from a GWAS study. There is a lot
of different data in dbGaP, and this diagram shows
different types of data and the relative
amounts we have. Lately there has definitely
been a growth in the whole genome portion
for genomics sequence data. If one has access to dbGaP one will soon be able to search
by minimum read coverage and by minimum allele percentage
across dbGaP studies, as shown here. If one has approved access to particular studies one
should be able to go in and look at data
for particular individuals. We are also putting
aggregate data, aggregate dbGaP data in ClinVar. In a ClinVar page, much like the one
I showed you earlier, we can now see variants
in dbGaP, those called by
dbGaP submitters, as well as the ones we see
in aligned sequence data. There are also background sets
for short nucleotide variation derived from the thousand
genomes in Go ESP. We are working to expand
those background sets into larger areas of dbGaP, as you saw with
the recent ClinVar example. I’d like to talk a little bit now about how to get reference
genomes from NCBI. First I’d like to introduce
the new human genome assembly, the h. GRCh38 is made by
the Genome Reference Consortium that I mentioned before. Also, one can
get reference genomes for many bacterial species, at this point about 37,000. One
can look at the Genome website for relationships
between these species, and if one wants to get sequence
for these particular genomes, one can go to
the NCBI Assembly website. By going to the Assembly website and click
on a particular strain, one can see all
the examples of those strains that we have and go
to particular examples and to their corresponding FTB
sites to get the FASTA files, annotation files, and also protein FASTA files. As of July 13th, 2015, we had 38,000–
over 38,000 RefSeq assemblies. 4,200 of these, roughly,
were complete genomes, with the rest being
scaffolds or contents. Some changes we’ve made lately
are that for proteins with identical sequence we
have combined their accessions into what we now know as WPs. However, there are still
individual annotations on reference genomes. What are reference genomes? Reference genomes
are genomes of high quality sequence at annotation
that we have identified, and annotated
by the NCBI PGAP pipeline. We also have representative
genomes in TAXA without reference genomes, as well as variant Genomes that are typically
represented on the scaffold or contig assembly level
and do not link to NCBI gene. The NCBI prokaryotic
Genome annotation pipeline is diagrammed here. I’m going to spend the next few
minutes talking about access to large datasets, and I’ll be talking specifically about how
to access genomics variants, as well as use
the Eutils facilities to cross between NCBI databases. One example of a large dataset I use frequently
is the aggregated data from the ClinVar resource
in VCF format. This can be easily localized on
either the GRCh37 or the GRCh38 reference
at the ClinVar FTP site. Another very useful VCF or XML that can be accessed
by FTP is that of dbSNP. PheGenI provides a phenotype, genotype integrator where one
can search by phenotype, location, gene, or SNP
for significant relationships between phenotype and genotype. The last NCBI topic I’ll
address today is Eutils. Eutils are a way
to cross between NCBI databases and a specific
video is available at the NCBI webinars page, which is easily accessible
by Googling NCBI webinars. NCBI Eutils allow you to cross
between many NCBI databases. Here is a cloud
of the NCBI E-utilities, and these are some examples
of some of the things that these allow you to do. We’ve also introduced E-direct, which are E-utilities
on the Linux command line. We also have a dbGaP data
browser that will be out soon for those interested
in visual access to dbGaP data. More detailed information
about NCBI resources can be found
at our webinars page. Here I have provided a link to
some other community resources for genomics that are not
available at NCBI. This is a very large list, but you or your colleagues may know of other tools
that are not present. This is provided on github
as a place where people are able to add tools by simply
doing pull requests. A couple of quick
pieces of advice about communicating
with bioinformaticians, please come up with specific
questions and workflows. Do have a specific question. Don’t ask something to find
something cool in the data. I want to give
a quick over view of each lecture
in this workshop. This lecture will be
followed by a lecture about the FASTQ file format
and some quality checks that can be done
with the Galaxy program. It will also provide
a brief overview of what the Galaxy framework
for genomics is. That will be followed
by a brief overview of Linux command line basics. Almost every community program for genomics is written
in the Linux framework. If you understand the basics
of the Linux command line then you should be able
to use some of these tools. Also, there will be portions
of the rest of the course which are taught
in the Linux command line. For example,
the following two lectures on DNAseq and Variant Calling, as well as RNAseq Mapping
and Read Counts, which will be presented
by Jonathan Pevsner. Finally, Tom Madden, the head
of the NCBI BLAST group, will present a lecture
about how to leverage new BLAST capabilities
for genomics research. At the end of the lectures
you will be able to use a hands-on
RNAseq mapping tutorial. I’d like to thank
the following people for putting together
this short workshop of lectures, and, finally, the advantage
of taking the workshop between October 13th
and October 23rd is that you can ask questions and we will
respond to them live. What you will do to ask
questions is to go to Biostars, found as, and tag your questions
with the “NCBI now,” tag. Thank you very much,
and have a great day.


  1. Giulia Pregno said:

    Hi, are the slides of this lecture (and the following ones) available?

    February 9, 2016
  2. Nutan Thakur said:

    I wish to submitt 16sRNA illumina Miseq metagenomics data to NCBI. which kind of tools and NCBI subset will be suitable for this purpose

    June 17, 2016

Leave a Reply

Your email address will not be published. Required fields are marked *