Data-mining

Macromolecular Structure Database Structural Database Infrastructure Services for Europe www.ebi.ac.uk/msd EMBL-EBI The MSD databases The MSD actually consists of two separate databases:  the archive database is highly normalized, with thousands of relationships linking some 400 tables; the deposition database is the definitive archive for all structural data at MSD  the search database is a much simpler, denormalized database, with data items duplicated and aggregated into 40 much wider tables, making it more amenable to searching and retrieval of data : the MSDSD EMBL-EBI What is the MSDSD database  A relational database primarily developed in Oracle that stores the data derived from the PDB together with reference and other derived information  Simple to understand for the novice biologist and fast in performance for the database non-expert  Originates from the internal MSD archive database that ensures accuracy and data integrity  In MSDSD naming and other summary information is repeated from every level of the hierarchy to the next one in order to be closer to the familiar PDB data EMBL-EBI Main MSDSD features  The symmetry has been expanded and the information of the quaternary biological assemblies is directly available  External Information like binding sites and secondary structure has been derived on the assembly level  The original PDB asymmetric unit is also available  Includes and provides clear database relations with the ligands “data mart” and other reference information  Includes information and cross-references to external databases (NCBI taxonomy, UniProt, SCOP etc) EMBL-EBI DATA Analysis Data mining is a term that is applied very loosely within bioinformatics to describe any type of data analysis. Almost without exception the analysis of molecular biology is hypothesis based where the search for information has a target that is defined by the knowledge of the biological context of the data. EMBL-EBI Data Mining “Analysis of data in a database using tools which look for trends or anomalies without knowledge of the meaning of the data.” “True data mining software does not just change the presentation, but discovers previously unknown relationships among the data. (Webopedia and other technical dictionaries)” was first “invented” by IBM EMBL-EBI Traditional analysis is via “verification-driven analysis” Requires hypothesis of the desired information (target) Requires correct interpretation of proposed query Discovery-driven data mining Finds data with common characteristics Results are ideal solutions to discovery Finds results without previous hypothesis EMBL-EBI So what is Hypothesis driven data analysis ?  Define a target = hypothesis  Search for target  There are/are-not “hits”  Verify/negate hypothesis  Distribution is centred on target “catalytic triad” : Atomic coordinates : Mathematical graph : HIS,ASP,SER : text string matching coordinate superposition graph matching data hierarchy knowledge EMBL-EBI For example, it is possible to find the presence of catalytic triads within the PDB by selecting an example structure and then using a matching technique such as coordinate superposition or graph analysis to screen this against all the coordinate data within the PDB. This will identify the presence of similar residue configurations to the search target and result in a distribution of hits centered on the original search model. HOWEVER we can only find similar objects distributed about this target. EMBL-EBI Discovery-driven data mining  Finds data with common characteristics  Results are ideal solutions to discovery  Finds results without previous hypothesis  Target is mathematical – so has no scientific dependency EMBL-EBI Mining techniques  Creation of predictive models : future data expectation  Link analysis : connections between data objects  Database segmentation : classification  Deviation detection : finding outliers. EMBL-EBI So what is this data mining ?  Given multiple sets of primary data)  Characters, numbers, Function(numbers),….  Find anomalies  To many : numerical occurrence  Data variation : Derivatives  Singularities  Correlations and clusters Finds new things ! But not what it means !  Within primary data  with other data (dependent variables) EMBL-EBI Discovery driven data mining of the PDB  Analysis of 3-dimensional coordinates  Defined common patterns of atomic interactions locally  DB segmentation - active sites & common packing features  Link analysis - Similarity between different functional group  Defined globally  DB segmentation - common patterns of super-secondary str’  Link analysis - common folds in diverse protein families  Outlier detection - unique folds  Nucleic Acid sequences  Define information content using information technology EMBL-EBI Issues  Systematic “error” propagates as solution  300 lysozyme structures return as a strong solution  Results cannot be found below the noise level  Need to characterise the noise level  Need to improve signal/noise ratio (S/N) to see information  Target is not biologically defined  It does not give you the biological answer  Results should reproduce known biology  Can give you new results not previously observed EMBL-EBI Data selection  Cannot leave in 300 lysozyme structures !  Select by sequence similarity at 70% exact alignment  Different “phase space” to select data  Remove structures with resolution < 2.5A  Remove NMR (different statistics)  Remove pre-1982 etc.  Geometrical analysis criteria to check for outliers EMBL-EBI Off the shelf products  Main problem – they “all” do column correlation – but this requires row analysis  Ie you can find whether x coordinates are more correlated to y coordinates than z coordinates  Slow  I tried the above on 1e3 of data and it took hours; not much chance on 1.6e9 data then.  Money often EMBL-EBI Local atomic interactions  Data  3D coordinates  Atom types  Residue types  Convert coordinates to distances - easier to compare, no need to superpose coordinates.  Create 3D Hash table of triplets of distances between “points” EMBL-EBI Local atomic interactions  Merge triplets  Any pair of N-fold interactions are a (N+1) interaction if they have (N-1) equivalence.  Just keep going until no more (N+1) interaction are found.  Time = 8 seconds  (Digital alpha ES40) EMBL-EBI Local atom interactions  Define key atoms/groupsof-atoms as run time parameters.  Solves problem of residue symmetry  Approximation for speed  This is a hypothesis  External definition of residue equivalence (PHE  TYR) for released data.  Improves Signal/Noise ratio.  This is a hypothesis EMBL-EBI Is this data mining ?  Basic 3D correlation of distances is  Program can be run without any prior definitions.  Addition of key atoms and residue equivalence introduces biology and chemistry  introduces hypothesis regarding what is important.  Without adding this information you get very little out.  Improvement to the method should spot this without being told ! EMBL-EBI Re-implemented  This idea has been re-implemented  Core analysis on distance only  Statistical analysis of residue equivalence is carried out – will find residue equivalence  Bit slower now – 2 minutes  To use MSD assembly data  Must be able to normalise by chain similarity to remove common features due to structure.  Can use MSD similarity tables for this. EMBL-EBI Refining the answers  This analysis produces approximate geometrical results  For each “solution”, a second full All vs. All LSQ overlay is performed  handles symmetry in D,E,R,P,T  handles different residue overlays  Clusters results using average linkage  Writes average + superposed coordinates + ligands. EMBL-EBI Catalytic quartet EMBL-EBI Electrostatic interaction Ligands are found close by rather than associated with the residues EMBL-EBI N-linked glycosolation binding site +?  Spot the non-sugar  This glycosolation site is the same as active site found in “1a53” – indol-3glycerolphosphate synthase EMBL-EBI Summary  Creates 1000’s of results  Returns many metal and catalytic sites  50% have at least 2 of 3 residues as sequence neighbours  30% have associated ligands http://www.ebi.ac.uk/msd-srv/MSDtemplate/ See T.J.Oldfield (2003) PROTEINS: Structure, Function, and Genetics 49, 510-528. T.J.Oldfield (2002) Acta Cryst. D57, 1421-1427 EMBL-EBI Data mining – not idiot proof  Date of birth and age will give 100 % correlation  Authors for structure submission will be correlated to authors on primary citation.  “Lysozyme” is the most common fold pattern  36 spelling’s of E.Coli will mask results.  Requires representative sets  Statistically valid ones too !  Signal/Noise ratio is a problem : hit the noise and the calculation grows rapidly EMBL-EBI Other methods  Representative sets and clustering  Another talk  Data mining fold  Information technological analysis of genomes

Data-mining

Related documents

Products

Support

Data-mining

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib