Document

advertisement

Practical 1

Discussion

1

Features of major databases

(PubMed and NCBI Protein Db)

2

Anatomy of PubMed Db

3

Epub ahead of print and journal impact factor

Adopted from : http://admin-apps.isiknowledge.com/JCR/JCR?RQ=LIST_SUMMARY_JOURNAL

How to get impact factor of any journal:

1) Direct source – web of science database

2) In direct source, e.g. blogs, sites etc

(do Google search)

4

Anatomy of a PubMed record

5

Demo on downloading articles

6

Anatomy of a Protein Db

7

Accession numbers and

GenInfo Identifiers

gi| numeric identifier | source | alphanumeric identifier humanP53 RefSeq mRNA record as an example: gi|120407067|ref|NM_000546.3

GI or Geninfo Identifier) 120407067

GI (or GenInfo Identifier) 120407067

Accession

Version

Refseq database

RefSeq database

NM_000546

Other popular sources: dbj – DDBJ (DNA Data Bank of Japan database) emb – The European Molecular Biology

Laboratory (EMBL) database prf – Protein Research Foundation database sp – SwissProt gb – GenBank pir – Protein Information Resource

8

Why do we need accession number and GI for one record?

1) What is the difference between accession and GI?

2) Why do we need these two when both seem to be accession numbers?

9

Why do we need accession number and GI for one record?

ACCESSION

NM_000546

GI

VERSION

120407067 NM_000546.3

8400737 NM_000546.2

4507636 NM_000546.1

NM_000546 NM_000546 NM_000546

Version

GI

Sequence_v1

NM_000546.1

4507636

Sequence update Sequence_v2

NM_000546.2

8400737

Sequence update

Sequence_v3

NM_000546.3

120407067

Q1) Which revision will NCBI show if you were to search by the accession only without the version number?

10

Accession numbers

- The unique identifier for a sequence record.

- An accession number applies to the complete record.

- Accession numbers do not change, even if information in the record is changed at the author's request.

- Sometimes, however, an original accession number might become secondary to a newer accession number, if the authors make a new submission that combines previous sequences , or if for some reason a new submission supercedes an earlier record.

11

GenInfo Identifiers

- GenInfo Identifier: sequence identification number

- If a sequence changes in any way, a new GI number will be assigned

- A separate GI number is also assigned to each protein translation

Within a nucleotide sequence record

- A new GI is assigned if the protein translation changes in any way

- GI sequence identifiers run parallel to the new accession.version system of sequence identifiers

12

Version

- A nucleotide sequence identification number that represents a single, specific sequence in the GenBank database.

- If there is any change to the sequence data (even a single base), the version number will be increased, e.g., U12345.1 → U12345.2, but the accession portion will remain stable.

- The accession.version system of sequence identifiers runs parallel to the GI number system, i.e., when any change is made to a sequence , it receives a new GI number AND an increase to its version number.

- A Sequence Revision History tool

(http://www.ncbi.nlm.nih.gov/entrez/sutils/girevhist.cgi) is available to track the various GI numbers, version numbers, and update dates for sequences that appeared in a specific GenBank record

13

Anatomy of a Protein Db record

14

Fasta Sequence

15

Fasta Format

• Text-based format for representing  nucleic acid sequences or peptide sequences (single letter codes).

• Easy to manipulate and parse sequences to programs.

Description line/row

Sequence data line(s)

Description line/row

Sequence data line(s)

>SEQUENCE_1

MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG

LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK

IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL

MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL

>SEQUENCE_2

SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI

ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH

Fasta Format (cont.)

Begins with a single-line description, followed by lines of sequence data.

• Description line

– Distinguished from the sequence data by a greater-than (">") symbol.

– The word following the ">" symbol in the same row is the identifier of the sequence.

– There should be no space between the ">" and the first letter of the identifier.

– Keep the identifier short and clear ; Some old programs only accept identifiers of only 10 characters. For example: > gi|5524211|Human or >HumanP53

• Sequence line(s)

– Ensure that the sequence data starts in the row following the description row (be careful of word wrap feature)

– The sequence ends if another line starting with a ">" appears; this indicates the start of another sequence.

Description line/row

Sequence data line(s)

Description line/row

Sequence data line(s)

>SEQUENCE_1

MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG

LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK

IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL

MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL

>SEQUENCE_2

SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI

ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH

Amino acids

18

IUPAC One Letter Amino Acid Code

• H

• I

• J

• K

• L

• M

• A

• B

• C

• D

• E

• F

• G

Alanine

ASx

Cysteine

Aspar(D)ic Acid

Glutamic Acid

(F)enylalanine

Glycine

Histidine

Isoleucine

Lysine

Leucine

Methionine

• U

• V

• W

• X

• Y

• Z

• N

• O

• P

• Q

• R

• S

• T

Asparagi(N)e

22 nd (Pyl) Pyrr(O)lysine

Proline

(Q)lutamine

(R)ginine

Serine

Threonine

21 st (Sec)Selenocysteine

Valine

T(W)ptophan

T(Y)rosine

GLx

Aspartic Acid

Asparagine

ASx

Arginine

Glutamic Acid

Glutamine

GLx

Lysine

Phenylalanine

Tyrosine

Tryptophan

21 st (Sec) Selenocysteine

22 nd (Pyl) Pyrrolysine

Note

Amino acid

Asparagine or aspartic acid

Glutamine or glutamic acid,

Leucine or Isoleucine,

Unspecified or unknown amino acid

Three letter code Single letter code

Asx B

GLx

Xle

Xaa

Z

J

X

Advice

• We highly recommend that you memorize the amino acid codes and their structures

• Memorizing the codes and in particular the structures will be very useful for this module and other modules, especially for research purposes.

• It is not compulsory that you memorize these for this module.

Features of major database

(Gene Db)

23

Anatomy of Gene Db

24

Anatomy of a Gene Db record

25

A section of Gene Db record:

Reference Sequences

mRNA

Accession number

Protein

Accession number

26

Take home messages for databases

Bioinformatics = databases + tools

General databases versus specialized databases

Databases come and go (especially the small ones)

Database redundancy - many databases for the same topic (use the most comprehensive, if not use all for comprehensiveness)

Database accuracy – published ones are more reliable; nevertheless, they are still prone to errors; always good to spend sometime assessing the reliability of your data of interest by doing cross-referencing to literature or other databases

Fortunately, most databases are cross-referenced

Unfortunately, no common standard format; need to spend some time familiarizing each; becomes easy after some practice

Finding databases relevant to you

– NAR Database catalogue

– Pubmed

– Google

2 main methods for searching databases (each with its own pros and cons)

– 1. Keyword search (covered today)

– 2. Sequence search (day 2)

27

Download