ks1

advertisement
K. SEKAR, Ph.D.
Dr. K. Sekar
Bioinformatics Centre
Supercomputer Education and Research Centre
Indian Institute of Science
Bangalore 560 012
INDIA
E-mail: sekar@physics.iisc.ernet.in
Voice: +91-080-3601409 or +91-080-2932469
Fax : +91-080-3600683 or +91-080-3600551
APPROACHES
TO DEVELOPING
DATA MINING
TOOLS
Abstract
Bioinformatics is one of the fastest
growing interdisciplinary areas in
the biological sciences and has
explored in such a way that we need
powerful tools to organize and
analyze the data. An overview will
be
presented
on
the
general
features of data mining tools,
techniques and its applications
Bioinformatics is the fashionable
new
name
for
the
field
previously called computational
biology.The name is preferred by
many because it puts the
emphasis on the data storage
and analysis, rather than on the
biology, and the field is really
data driven
The term Bioinformatics is used to encompass almost
all computer applications in biological sciences, but
was originally coined in the mid 1980’s for the
analysis of biological sequence data
The quantity of known sequences data outweighs
protein structural data and by virtue of the genome
projects, sequence database are doubling in size
every year
A key challenge of bioinformatics is to analyze the
wealth of sequence data in order to understand the
amassed information in term of protein structure
function and evolution
Wherever possible, a range of different methods
should be used, and the results should be married with
all available biological information
Bioinformatics has provided us with a
communication channel to reach and
decode
all
this
information
in
a
comprehensive manner
Both the large information repositories and
the specialized tools to query them are held
on distributed internet sites, therefore
Bioinformatics
require
sound
internet
navigation skills
The primary integrating technology that
facilitates access to copious data is the
world wide web
Refers
to
database-like
activities
involving persistent sets of data that
are maintained in a consistent state
over essentially indefinite periods of
time
Encompass the use
of
algorithmic
tools to facilitate biological database
analyses
Comprises the entire collection of
information
management
systems,
analysis tools and communication
networks supporting biology
DATA MINING
Datamining is defined as
“exploration and analysis
by automatic and semiautomatic means, of large
quantities of data in order
to
discover
meaningful
patterns and rules”
The central challenge is to
derive maximum results from
the wealth of data.This can
be achieved by establishing
and maintaining databases
and providing search and
analysis tools to interpret the
data
DATABASE
Database is nothing but a collection
of quantitative data resulting from
experimental
measurements
or
observations in various fields of
science.Recently
interest
in
database has been kindled through
international efforts to organize and
analyze the data and update the
knowledge
A database is essentially just
a store of information.They
are usually in the form of
simple files (just a flat file,
say).You
can
shove
information into this store or
retrieve it from the store
Derived Database
One of the greatest challenges in
database
research
is
analyze
the
database in depth and create derived
databases to meet the needs or demands
without compromising the sustainability
and quality of the existing database.
Creating desired database is expected is
expected to dramatically reduce the
workload of the user community and will
serve as a highly focused database
DBREF 1UNE
1 123 SWS P00593 PA2_BOVIN
23 145
SEQADV 1UNE ASN 122 SWS P00593 LYS 144 CONFLICT
SEQRES 1 123 ALA LEU TRP GLN PHE ASN GLY MET ILE LYS CYS LYS ILE
SEQRES 2 123 PRO SER SER GLU PRO LEU LEU ASP PHE ASN ASN TYR GLY
SEQRES 3 123 CYS TYR CYS GLY LEU GLY GLY SER GLY THR PRO VAL ASP
SEQRES 4 123 ASP LEU ASP ARG CYS CYS GLN THR HIS ASP ASN CYS TYR
SEQRES 5 123 LYS GLN ALA LYS LYS LEU ASP SER CYS LYS VAL LEU VAL
SEQRES 6 123 ASP ASN PRO TYR THR ASN ASN TYR SER TYR SER CYS SER
SEQRES 7 123 ASN ASN GLU ILE THR CYS SER SER GLU ASN ASN ALA CYS
SEQRES 8 123 GLU ALA PHE ILE CYS ASN CYS ASP ARG ASN ALA ALA ILE
SEQRES 9 123 CYS PHE SER LYS VAL PRO TYR ASN LYS GLU HIS LYS ASN
SEQRES 10 123 LEU ASP LYS LYS ASN CYS
HET CA 124
1
HETNAM
CA CALCIUM ION
FORMUL 2 CA CA1 2+
FORMUL 3 HOH *134(H2 O1)
HELIX 1 1 LEU
2 LYS 12 1
11
HELIX 2 2 PRO 18 ASP 21 1
4
HELIX 3 3 ASP 40 LYS 57 1
18
HELIX 4 4 ASP 59 VAL 63 1
5
HELIX 5 5 ALA 90 LYS 108 1
19
HELIX 6 6 LYS 113 HIS 115 5
3
SHEET 1 A 2 TYR 75 SER 78 0
SHEET 2 A 2 GLU 81 CYS 84 -1 N THR 83 O SER 76
SSBOND 1 CYS 11 CYS 77
SSBOND 2 CYS 27 CYS 123
SSBOND 3 CYS 29 CYS 45
SSBOND 4 CYS 44 CYS 105
SSBOND 5 CYS 51 CYS 98
SSBOND 6 CYS 61 CYS 91
SSBOND 7 CYS 84 CYS 96
LINK
CA CA 124
O TYR 28
LINK
CA CA 124
O GLY 32
CRYST1 47.120 64.590 38.140 90.00 90.00 90.00 P 21 21 21 4
SUB-DERIVED DATABASE
EXAMPLE-1
XXXXXSEKAR
RADHASEKAR
SHAMIASEKAR
SARADASEKAR
EXAMPLE-2
XAXAXA
KAMALA
SARADA
YAMAHA
KANAGA
MANASA
VANASA
PANAMA
Adding
information
to the database
Software to
collate the required
Information from
the database
Analyze
the collated
information
WHY A TOOL?
The amount of information in the
world is growing exponentially, and it
is becoming impossible to effectively
manage the data.Machine assistance
is clearly necessary, but the difficulty
lies
in
designing
systems
and
softwares
that
are
capable
of
discovering “useful” information with
minimal human intervention
PROTEIN DATA BANK
(PDB)
GENOME DATABASE
(GDB)
STRUCTURAL CLASSIFICATION OF
PROTEINS
(SCOP)
CAMBRIDGE STRUCTURAL DATABASE
(CSD)
Given PDB-Id : 1une
HEADER HYDROLASE
05-NOV-97 1UNE
TITLE CARBOXYLIC ESTER HYDROLASE, 1.5 ANGSTROM ORTHORHOMBIC
FORM
TITLE 2 OF THE BOVINE RECOMBINANT PLA2
COMPND MOL_ID: 1;
COMPND 2 MOLECULE: PHOSPHOLIPASE A2;
COMPND 3 CHAIN: NULL;
COMPND 4 EC: 3.1.1.4;
COMPND 5 ENGINEERED: YES
SOURCE MOL_ID: 1;
SOURCE 2 ORGANISM_SCIENTIFIC: BOS TAURUS;
SOURCE 3 ORGANISM_COMMON: BOVINE;
SOURCE 4 EXPRESSION_SYSTEM: ESCHERICHIA COLI;
SOURCE 5 EXPRESSION_SYSTEM_STRAIN: BL21 (DE3) PLYSS;
SOURCE 6 EXPRESSION_SYSTEM_PLASMID: PTO-A2MBL21;
SOURCE 7 EXPRESSION_SYSTEM_GENE: MATURE PLA2
KEYWDS HYDROLASE, ENZYME, CARBOXYLIC ESTER HYDROLASE
EXPDTA X-RAY DIFFRACTION
AUTHOR M.SUNDARALINGAM
REVDAT 1 06-MAY-98 1UNE 0
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
1 REFERENCE 1
1 AUTH K.SEKAR,A.KUMAR,X.LIU,M.-D.TSAI,M.H.GELB,
1 AUTH 2 M.SUNDARALINGAM
1 TITL CRYSTAL STRUCTURE OF THE COMPLEX OF BOVINE
1 TITL 2 PANCREATIC PHOSPHOLIPASE A2 WITH A TRANSITION STATE
1 TITL 3 ANALOGUE
1 REF TO BE PUBLISHED
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
1 REFN
0353
1 REFERENCE 2
1 AUTH K.SEKAR,C.SEKARUDU,M.-D.TSAI,M.SUNDARALINGAM
1 TITL 1.72A RESOLUTION REFINEMENT OF THE TRIGONAL FORM OF
1 TITL 2 BOVINE PANCREATIC PHOSPHOLIPASE A2
1 REF TO BE PUBLISHED
1 REFN
0353
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
1 REFERENCE 3
1 AUTH K.SEKAR,S.ESWARAMOORTHY,M.K.JAIN,M.SUNDARALINGAM
1 TITL CRYSTAL STRUCTURE OF THE COMPLEX OF BOVINE
1 TITL 2 PANCREATIC PHOSPHOLIPASE A2 WITH THE INHIBITOR
1 TITL 3 1-HEXADECYL-3-(TRIFLUOROETHYL)-SN-GLYCERO-21 TITL 4 PHOSPHOMETHANOL
1 REF BIOCHEMISTRY
V. 36 14186 1997
REMARK
REMARK
REMARK
REMARK
2 RESOLUTION. 1.5 ANGSTROMS.
3 REFINEMENT.
3 PROGRAM : X-PLOR 3.1
3 AUTHORS : BRUNGER
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
3
3
3
3
3
3
3
3
DATA USED IN REFINEMENT.
RESOLUTION RANGE HIGH (ANGSTROMS) : 1.5
RESOLUTION RANGE LOW (ANGSTROMS) : 10.0
DATA CUTOFF
(SIGMA(F)) : 1.0
DATA CUTOFF HIGH
(ABS(F)) : 0.1
DATA CUTOFF LOW
(ABS(F)) : 1000000.0
COMPLETENESS (WORKING+TEST) (%) : 92.
NUMBER OF REFLECTIONS
: 17572
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
3
3
3
3
3
3
3
3
FIT TO DATA USED IN REFINEMENT.
CROSS-VALIDATION METHOD
: NULL
FREE R VALUE TEST SET SELECTION : X-PLOR
R VALUE
(WORKING SET) : 0.184
FREE R VALUE
: 0.228
FREE R VALUE TEST SET SIZE (%) : 7.
FREE R VALUE TEST SET COUNT
: 1198
ESTIMATED ERROR OF FREE R VALUE : 0.24
REMARK 3 PARAMETER FILE 1 : PARHCSDX.PRO
REMARK 3 PARAMETER FILE 2 : NULL
REMARK 3 TOPOLOGY FILE 1 : TOPHCSDX.PRO
REMARK 3 TOPOLOGY FILE 2 : NULL
REMARK 3 OTHER REFINEMENT REMARKS: NULL
REMARK 4 1UNE COMPLIES WITH FORMAT V. 2.2, 16-DEC-1996
REMARK 200
REMARK 200 EXPERIMENTAL DETAILS
REMARK 200 EXPERIMENT TYPE
: X-RAY DIFFRACTION
REMARK 200 DATE OF DATA COLLECTION
: 26-JAN-1996
REMARK 200 TEMPERATURE
(KELVIN) : 291
REMARK 200 PH
: 7.2
REMARK 200 NUMBER OF CRYSTALS USED
:1
REMARK 200
REMARK 200 SYNCHROTRON
(Y/N) : N
REMARK 200 RADIATION SOURCE
: NULL
REMARK 200 BEAMLINE
: NULL
REMARK 200 X-RAY GENERATOR MODEL
: R-AXIS IIC
REMARK 200 MONOCHROMATIC OR LAUE (M/L) : M
REMARK 200 WAVELENGTH OR RANGE
(A) : 1.5418
REMARK 200 MONOCHROMATOR
: GRAPHITE
REMARK 200 OPTICS
: NULL
REMARK 200
REMARK 200 IN THE HIGHEST RESOLUTION SHELL.
REMARK 200 HIGHEST RESOLUTION SHELL, RANGE HIGH (A) : 1.5
REMARK 200 HIGHEST RESOLUTION SHELL, RANGE LOW (A) : 1.55
REMARK 200 COMPLETENESS FOR SHELL (%) : 63.
REMARK 200 DATA REDUNDANCY IN SHELL
: 3.7
REMARK 200 R MERGE FOR SHELL
(I) : 0.172
REMARK 200 R SYM FOR SHELL
(I) : NULL
REMARK 200 FOR SHELL
: NULL
REMARK 200
REMARK 200 METHOD USED TO DETERMINE THE STRUCTURE: THE HIGH RESOLUTION
REMARK 200 ATOMIC COORDINATES OF THE WILD TYPE (PDB ENTRY 1BP2)
REMARK 200 WERE USED AS THE STARTING MODEL FOR REFINEMENT.
REMARK 200 SOFTWARE USED: X-PLOR
REMARK 200 STARTING MODEL: WILD TYPE (PDB ENTRY 1BP2)
REMARK 200
REMARK 200 REMARK: NULL
REMARK 280
REMARK 290
REMARK 290 CRYSTALLOGRAPHIC SYMMETRY
REMARK 290 SYMMETRY OPERATORS FOR SPACE GROUP: P 21 21 21
REMARK 290
REMARK 290
SYMOP SYMMETRY
REMARK 290 NNNMMM OPERATOR
REMARK 290
1555 X,Y,Z
REMARK 290
2555 1/2-X,-Y,1/2+Z
REMARK 290
3555 -X,1/2+Y,1/2-Z
REMARK 290
4555 1/2+X,1/2-Y,-Z
DBREF 1UNE
1 123 SWS P00593 PA2_BOVIN
23 145
SEQADV 1UNE ASN 122 SWS P00593 LYS 144 CONFLICT
SEQRES 1 123 ALA LEU TRP GLN PHE ASN GLY MET ILE LYS CYS LYS ILE
SEQRES 2 123 PRO SER SER GLU PRO LEU LEU ASP PHE ASN ASN TYR GLY
SEQRES 3 123 CYS TYR CYS GLY LEU GLY GLY SER GLY THR PRO VAL ASP
SEQRES 4 123 ASP LEU ASP ARG CYS CYS GLN THR HIS ASP ASN CYS TYR
SEQRES 5 123 LYS GLN ALA LYS LYS LEU ASP SER CYS LYS VAL LEU VAL
SEQRES 6 123 ASP ASN PRO TYR THR ASN ASN TYR SER TYR SER CYS SER
SEQRES 7 123 ASN ASN GLU ILE THR CYS SER SER GLU ASN ASN ALA CYS
SEQRES 8 123 GLU ALA PHE ILE CYS ASN CYS ASP ARG ASN ALA ALA ILE
SEQRES 9 123 CYS PHE SER LYS VAL PRO TYR ASN LYS GLU HIS LYS ASN
SEQRES 10 123 LEU ASP LYS LYS ASN CYS
HET CA 124
1
HETNAM
CA CALCIUM ION
FORMUL 2 CA CA1 2+
FORMUL 3 HOH *134(H2 O1)
HELIX 1 1 LEU
2 LYS 12 1
11
HELIX 2 2 PRO 18 ASP 21 1
4
HELIX 3 3 ASP 40 LYS 57 1
18
HELIX 4 4 ASP 59 VAL 63 1
5
HELIX 5 5 ALA 90 LYS 108 1
19
HELIX 6 6 LYS 113 HIS 115 5
3
SHEET 1 A 2 TYR 75 SER 78 0
SHEET 2 A 2 GLU 81 CYS 84 -1 N THR 83 O SER 76
SSBOND 1 CYS 11 CYS 77
…
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
FIT IN THE HIGHEST RESOLUTION BIN.
TOTAL NUMBER OF BINS USED
:8
BIN RESOLUTION RANGE HIGH
(A) : 1.5
BIN RESOLUTION RANGE LOW
(A) : 1.55
BIN COMPLETENESS (WORKING+TEST) (%) : 63.
REFLECTIONS IN BIN (WORKING SET) : 1176
BIN R VALUE
(WORKING SET) : 0.340
BIN FREE R VALUE
: 0.352
BIN FREE R VALUE TEST SET SIZE (%) : 7.
BIN FREE R VALUE TEST SET COUNT : 81
ESTIMATED ERROR OF BIN FREE R VALUE : NULL
NUMBER OF NON-HYDROGEN ATOMS USED IN REFINEMENT.
PROTEIN ATOMS
: 957
NUCLEIC ACID ATOMS
:0
HETEROGEN ATOMS
:1
SOLVENT ATOMS
: 134
B VALUES.
FROM WILSON PLOT
(A**2) : NULL
MEAN B VALUE
(OVERALL, A**2) : NULL
LOW RESOLUTION CUTOFF
(A) : NULL
CROSS-VALIDATED ESTIMATED COORDINATE ERROR.
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
1 N
2 CA
3 C
4 O
5 CB
6 N
7 CA
8 C
9 O
10 CB
11 CG
12 CD1
13 CD2
14 N
15 CA
16 C
17 O
18 CB
19 CG
20 CD1
21 CD2
22 NE1
23 CE2
24 CE3
25 CZ2
26 CZ3
27 CH2
ALA
ALA
ALA
ALA
ALA
LEU
LEU
LEU
LEU
LEU
LEU
LEU
LEU
TRP
TRP
TRP
TRP
TRP
TRP
TRP
TRP
TRP
TRP
TRP
TRP
TRP
TRP
1
1
1
1
1
2
2
2
2
2
2
2
2
3
3
3
3
3
3
3
3
3
3
3
3
3
3
13.830
12.869
12.106
12.366
11.891
11.150
10.392
9.556
9.465
9.522
8.919
10.038
8.027
8.960
8.157
8.998
8.580
7.359
8.163
8.699
8.505
9.348
9.253
8.258
9.754
8.761
9.503
17.835
16.725
16.547
17.226
17.029
15.638
15.362
16.543
16.764
14.116
13.539
13.103
12.361
17.305
18.443
19.448
19.864
19.103
19.810
19.262
21.199
20.230
21.428
22.278
22.695
23.542
23.735
32.697
32.889
31.592
30.614
34.056
31.585
30.376
29.879
28.657
30.561
29.291
28.360
29.656
30.796
30.347
29.543
28.472
31.491
32.534
33.683
32.555
34.403
33.743
31.686
34.083
32.026
33.216
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
11.41
11.31
12.00
11.37
11.89
13.43
14.98
14.65
13.62
15.03
17.13
17.29
17.65
14.18
16.10
14.26
14.34
19.02
24.63
25.51
27.29
27.56
28.36
27.60
28.94
28.78
29.43
CAMBRIDGE STRUCTURAL DATABASE
• The CAMBRIDGE STRUCTURAL DATABASE
• Software for search, Retrieval Display and
Analysis of CSD contents
The CSD records bibliographic, 2D chemical
and 3D structural results from crystallographic
analysis of organics, organometallics and metal
complexes .Both X-Ray and Neutron Diffraction
studies are included for small and medium
sized compounds containing upto 500 atoms
including hydrogens)
THREE DBA
COMPONENTS
Database Integrity
Database Security
Database Recovery
DATABASE INTEGRITY
The major issue for the database management is to ensure
that the data in the database is accurate, correct, valid and
consistent.Any inconsistency between two or more entries
that represent the same entity demonstrates the lack of
integrity
Database technology cannot do very much to protect users
against data errors made in the outside world before the
data has been entered in the system
However, certain safety measures can be built into a
database to ensure that errors within the system are
minimized
DATA RECOVERY
The process of recovery involves restoring the
database to a state which is know to be correct
following some kind of failure
The technique of redundancy is used in the sense
that it has to be possible to recover the database
to its correct state from information available
somewhere else in the system
The most common way to achieve this is to dump
the contents of the database with the defined
frequency on another medium, magnetic tape or
optical disk, which is then stored in the same
place
DATABASE SECURITY
The DBA has to ensure that
adequate measures are taken to
prevent
unauthorized
disclosure,
alteration or destruction of both the
data within the database and the
database software itself
A password and a list of privileges
attach to it are most commonly
used to control user access rights to
database information
THREE COMPONENTS OF
DATABASE
Development of a database structure
that
allows
the
storage
and
maintenance of the required data
Data entry,
management
maintenance
and
Retrieval of the data by end users
equipped with suitable analysis and
display tools
DATABASE ADMINISTRATION
The database administrator (DBA) is a
person or a group of persons responsible
for overall control of database systems
The DBA is usually not only answerable
for the design of the database, but also
for
choice
of
DBMS
used,
its
implementation
and training of all
involved in the database running and use
Once the data is entered, it has to be
maintained and kept upto date
PROBLEMS WITH THE DATA
Incomplete data
Noisy data
Temporal data
An extremely large amount
of data
Non-textual data
INCOMPLETE DATA
Some data may be missing
(e.g., some fields may be left
blank)
Sometimes, the fact that
missing data itself is a valuable
piece of information
NOISY DATA
The
field
may
contain
incorrectly entered information
We do not know how does this
affect the certainty factor (or)
confidence level of the results
TEMPORAL DATA
Since database grow rapidly, how
can data be incrementally added to
our results
What effect should this have in the
knowledge discovery process
AN EXTREMELY LARGE
AMOUNT OF DATA
Some datasets
over time
can
grow
significantly
How should such datasets be processed ?
The option is to perform parallel
processing, where n processors, each
process approximately 1/n’ th of the data
in approximately 1/n’ th of the time
NON-TEXTUAL DATA
There are many types of
data that need to be
manipulated,
including
image data, multimedia
data (Video and Sound),
spatial data in GIS and
user defined data types
Data
Target data
“Cleaned”
data
Selection
Preprocessing
&
transformation
Data Mining
Patterns
knowledge
Interpolation
evaluation &
validation
Stand alone machine
application
Web Application
PERL
Very powerful for string manipulation
Uses CGI as the interface
JAVA
Application programming(Standalone machine)
Applet Programming
(Web oriented)
Useful for graphics application over the
WWW
WHAT IS PERL?
PERL is an interpreted language optimized for
scanning arbitrary test files,
extracting
information from those text files
The language is intended to be practical (easy to
use, efficient, complete) rather than beautiful
(tiny, elegant and minimal)
PERL uses sophisticated pattern matching
techniques to scan large amounts of data very
quickly.Although optimized for scanning text,
PERL can also deal with binary data and can
make dbm files look associate arrays
CGI(CommonGateway Interface)
Common Gateway interface (CGI),
as its name implies, provides a
gateway between a user (Client) and
command/logic oriented server
CGI performs the task of translation,
means translates the needs of
clients into server requests and then
back translates server replies to
clients
Client Client
Java Servlet
CGI
Server Server
RMI concept is very
useful for multitier
architecture
EXAMPLE
www.hotmail.com
www.google.com
Software
(Search Engine)
RMI
WEB-Page
Java Server pages
(sun micro systems)
Active server pages
(Microsoft corporation)
useful for dynamic web
page creation
GRAPHICAL USER
INTERFACE
(GUI)
The Programmer can quickly design the
user interface by drawing and arranging
the screen elements rather than writing
the raw code
CGI is easily visualizable to users
It is user friendly
Example:
MS-WINDOWS OPERATING SYSTEMS
GUI (Graphical User Interface)
Active X
(Microsoft corporation)
Java swing
(Sun micro systems)
Buttons, boxes and pull down
menus (windows based)
VB (Visual Basic)
Application development languages.
Supports graphics
Good for standalone
applications
Web
programming
is
not
possible.But it is possible to use
script languages(vb script or java
script) to make it web oriented
VC++
System & Application
Programming
Almost same as VB
Additional advantage
System side
WORLD WIDE WEB (W W W)
World Wide Web is the famous and fastest growing
Internet function.It is the way of accessing
information already on the Internet using the concept
of hypertext to link information.Like FTP, any types of
digital documents, images, artwork, movies and
sounds on the remote computer can be made
hyperlinks.The protocol used for accessing such
information is HTTP (Hyper Text Transfer Protocol)
The hyper linked documents are known as HTML
documents. They are written in a special language
called HTML, stands for Hyper Text Markup Language.
The HTML is nothing but ASCII text with embedded
tags on it
DBMS & RDBMS
DBMS:
Dbase
MS-Access
Mysql-server
FoxPro
(partially RDBMS)
RDBMS: Sybase
Oracle
SQL-server
DATABASE
a bunch of tables
TABLES
Store numerous rows of information
FIELDS
The little boxes inside a tables
An expensive whopper of a database system
called SQL server, which is used in corporation
that needs to store huge wads of information
ORACLE, which is another database format
The best way to create your own access
database is by using, microsoft access.This tool
chips with the professional edition of office-87
and enables you to graphically design your own
tables and individual field.
Yet another one my-SQL
Typical Web Search
Keywords
Search Engine
Output
Web Browser
HTML
Form
O/p (in HTML)
WWW
HTML
Form
O/p (in HTML)
CGI-Program
Flat file
Mirror sites
PDB
GDB
SCOP
PROTEIN
DATABANK
PDB
144.16.71.2
144.16.49.185
203.90.127.146 (VPN users)
PDB-MIRROR MACHINE
3.40 GHz PIV machine
2 GB RD RAM
1 Tera-byte Hard Disk
32 MB Graphics Card
Powered by Intel SOLARIS
PDB
The PDB server is up-to-date and as
of now contains 24,080 coordinate
entries(21,788 proteins, 992 protein
and nucleic acid complexes, 1282
nucleic acids.
GENOME
DATABASE
GDB
144.16.71.10
144.16.49.185
203.90.127.147 (VPN users)
GDB-MIRROR site machine
3.40 GHz PIV machine
2 GB RD RAM
1 Tera-byte Hard Disk
32 MB Graphics Card
Powered by Intel SOLARIS
Structural Classification
of Proteins
SCOP
144.16.71.2/scop
144.16.49.78/scop
203.90.127.146/scop
(for VPN users)
SCOP
The SCOP mirror site at the
institute has been created and
maintained with the latest copy.
Now the mirror site (version 1.63,
May
2003
release)
contains
49,497 domains from 18,946 PDB
entries.
Packages developed at the
Bioinformatics Centre
Raman Building
Indian Institute of Science
Bangalore 560 012
Dr. K. SEKAR
E-mail sekar@physics.iisc.ernet.in
GENOME
SEQUNECES
MSGS
Motif Search in Genome Sequences
-A web based interactive display tool
P. Selvarani, B.N. Vijay, V. Shanthi, S. Saravanan
and K. Sekar
(To be submitted)
http://144.16.71.10/msgs (Internet users)
http://203.90.127.147/msgs (VPN users)
THGS
A Web based database of
Transmembrane Helices in Genome
Sequences
S.A. Fernando, P. Selvarani, Soma Das, Ch. Kiran kumar,
S. Mondal, S. Ramakumar and K. Sekar
NUCL. ACIDS RES. (2004), 32, D125-D128
http://144.16.71.10/thgs (Internet users)
http://203.90.127.147/thgs (VPN users)
PROTEIN
SEQUNECES
PSST
Protein Sequence Search Tool
-A web based interactive
search engine
S. Saravanan, A. Ajmal Khan and K. Sekar
CURR. SCI. (2000), 550-552
http://144.16.71.10/psst (Internet users)
http://203.90.127.147/psst (VPN users)
PROTEIN
STRUCTURES
BSDD
Biomolecules Segment Display Device
-A web based interactive display tool
P. Selvarani, V. Shanthi, C.K. Rajesh, S. Saravanan
and K. Sekar
J. MOL. GRA. & MODEL. (2004) (In the press)
http://144.16.71.2/bsdd (Internet users)
http://203.90.127.146/bsdd (VPN users)
PDB Goodies
-a web-based GUI to manipulate
the Protein Data Bank file
A.S.Z. Hussain, V. Shanthi, S.S. Sheik,
J. Jeyakanthan, P. Selvarani and K. Sekar
ACTA. CRYST. (2002), D58, 1385-1386
http://144.16.71.11/pdbgoodies (Internet users)
http://203.90.127.149/pdbgoodies (VPN users)
CAP
Conformation Angles Package
-Displaying the conformation angles
of side chains in proteins
S.S. Sheik, P. Sundararajan, V. Shanthi and K. Sekar
BIOINFORMATICS (2003), 19, 1043-1044
http://144.16.71.146/cap (Internet users)
http://203.90.127.148/cap (VPN users)
WAP
- a Web-based package to calculate
geometrical parameters between
water oxygen and protein atoms
V. Shanthi, C.K. Rajesh, J. Jayalakshmi, V.G. Vijay
and K. Sekar
J. APPL. CRYST. (2003), 36, 167-168
http://144.16.71.11/wap (Internet users)
http://203.90.127.149/wap (VPN users)
RP
Ramachandran Plot on the web
S.S. Sheik, P. Sundararajan, A.S.Z. Hussain
and K. Sekar
BIOINFORMATICS (2002), 18, 1548-1549
http://144.16.71.146/rp (Internet users)
http://203.90.127.148/rp (VPN users)
SSEP
Secondary Structural Elements
of Proteins
V. Shanthi, P. Selvarani, Ch. Kiran Kumar,
C.S.Mohire and K. Sekar
NUCL. ACIDS RES. (2003), 31, 3404-3405
http://144.16.71.148/ssep (Internet users)
http://203.90.127.150/ssep (VPN users)
SEM
Symmetry Equivalent Molecules
A.S.Z. Hussain, Ch. Kiran Kumar, C.K. Rajesh,
S.S. Sheik and K. Sekar
NUCL ACIDS RES. (2003), 31, 3356-3358.
http://144.16.71.11/sem (Internet users)
http://203.90.127.149/sem (VPN users)
CADB
Conformational Angles DataBase of proteins
S.S. Sheik, P. Ananthalakshmi, G. Ramya Bhargavi
and K. Sekar
NUCL. ACIDS RES. (2003), 31(1), 448-451
http://144.16.71.148/cadb (Internet users)
http://203.90.127.150/cadb (VPN users)
Non-homologous (25% Identity) protein chains
Hobohm & Sander, Protein Sci. 3, 522-524
X-Ray Diffraction
NMR
:
:
1,276 (25)
460 (2)
Fibre Diffraction
Others
Total no. of chains
:
:
:
3 (0)
0 (5)
1,739 (32)
Total no. of residues in
X-Ray Diffraction
NMR
:
:
2,53,623
37,281
Numbers within the paranthesis denote files having C coordinates.
Non-homologous (90% Identity) protein chains
Hobohm & Sander, Protein Sci. 3, 522-524
X-Ray Diffraction
NMR
:
:
5,147 (26)
993 (5)
Fibre Diffraction
Others
Total no. of chains
:
:
:
6 (0)
0 (5)
6,146 (36)
Total no. of residues in
X-Ray Diffraction
NMR
: 11,29,466
:
72,145
Numbers within the paranthesis denote files having C coordinates.
LySDB
Lysozyme Structural DataBase
K. S. Mohan, Soma Das, C. Chockalingham,
V. Shanthi & K. Sekar
ACTA
CRYST. (2004), D60, 597-600.
http://144.16.71.2/lysdb (Internet users)
http://203.90.127.146/lysdb (VPN users)
TAKE HOME MESSAGE
Datamining
is
nothing
but
exploiting the Hidden Trends in
your data
Create your own derived database
No one tool or set of tools is
universally applicable
Present the data in a useful format
such as graph or table
Department of Biotechnology
Ministry of Science & Technology
Govt. of India
India
&
Jai Vigyan National Science Foundation
Govt. of India
India
Download