KNIME Visual Programming for Metabolomics Stephan Beisken Visual Programming • “Visual programming languages enable physicians and other computer users with little knowledge of programming to develop computer software. The physician uses a visual paradigm to "draw" the computer interface and then attaches short segments of computer code to buttons, menus, and list boxes.” Ebell, M. H. (1993). Visual programming languages. M.D. Computing : Computers in Medical Practice, 10(5), 305–11. Motivation • Simplify your (working) life • Data processing and analysis requires various different tools to work together in sequence • Data input and output • Spreadsheets • Data transformation • Transposition, aggregation, string manipulation • IsaCreator • Formatting of tables Agenda • Introduction • Tutorial • Installation and Extensions • Overview of the Workbench • Nodes and Table Models • Exercises • Introductory Examples • MassCascade • OpenMS • XCMS • Slides, software, workflows, and data for takeaway Disclaimer • Workflows are great • It does not have to be KNIME, there are many other solutions • Every method that captures information in a consistent manner and enables reproducibility is great • Transparency • Ability to share data and ‘everything’ that was done to the data Who is already a KNIME user? Introduction • KNIME: Konstanz Information Miner • http://www.knime.org/ • Developed at University of Konstanz in Germany • Desktop version available free of charge (open source) • Modular platform for building and executing workflows using predefined components: nodes • Core functionality available for tasks such as data mining, analysis, and manipulation • Extra features and functionality available in KNIME through extensions from various groups (community) and vendors • Written in Java based on the Eclipse SDK platform Workflow Concepts • Workflow execution • Can execute complex, multi-step operations on input data • Can be run be “non-experts” using predefined parameter templates ensuring optimal results • Can be set up for specific measurement systems • Can be shared across researchers Functionality • Data manipulation and analysis • File & database I/O, sorting, filtering, grouping, joining, pivoting • Data mining and machine learning • R, WEKA, KNIME, interactive plotting • Cheminformatics • Conversions, similarity, clustering, (Q)SAR analysis, etc. • Scripting integration • R, Perl, Python, Matlab, Octave, Groovy • Reporting and much more • Bioinformatics, HTS & image analysis, network & text mining • Marketing, big data and business analytics Modules (Community Extensions) • http://tech.knime.org/community • Chemoinformatics • CDK (EMBL-EBI), RDKit (Novartis), Indigo (GGA), • ErlWood (Eli Lilly), Enalos (NovaMechanics) • ChEMBL and ChEBI (EMBL-EBI) • Bioinformatics • OpenMS (Tübingen, ETH Zurich) • MassCascade (EMBL-EBI) • HCS (MPI), NGS (Konstanz), Image analysis • Integration • Python, Perl, R, Groovy, Matlab (MPI), PDB web services client (Vernalis), REST and SOAP web service support Workflow Platforms Applications Applications cont. Applications cont. Applications cont. Calibration Regression Applications cont. Advantages Disadvantages • Intuitive to use • No or little programming • • • • experience required • • • • • Good for prototyping Lots of functionality Very modular and flexible Active community Extensible • Visual Feedback Steep learning cure Resource greedy No (free) server edition Slower execution than standalone scripts Installation • Download and unzip KNIME • No further setup required • ./knime.ini contains arguments for launch • Install new modules (nodes) from update sites • Explorer and installation wizard provided • Workflows and data are stored in a workspace • ~/<user>/knime/workspace • C:\Users\<user>\knime\workspace • Preferences in: File Preferences KNIME Workbench Auto-layout Execute Execute all nodes Node description tabs workflow projects favorite nodes public server workflow editor node repository outline console Nodes • Node: Basic processing unit of a workflow • performs a particular task Input port(s) – on the left of icon Title Output port(s) – on the right of icon Icon Status display (‘traffic lights’) Sequence number • Red (not ready) • Amber (ready) • Green (executed) • Blue bar during execution (with percentage or flashing) Right-click menu To configure and execute the node, display the output views, edit the node, and display data for the ports Dialogs • Double-click opens configuration dialogs • Explicit column types Tables Table rows Column specifications Various renderers Column types Exercises: Preliminaries • Pre-installed KNIME Desktop 2.9.1 • Workflows • starters, xcms, openms, masscascade • Data • FAAH knockout LC/MS data • ESB tomato LC/MS QC data • ChEBI SDFile, KEGG SDFile • Plug-Ins (more in About KNIME Installation Details) • R (interactive) • Erl Wood, CDK • OpenMS, MassCascade Exercises: Installation • Open your KNIME directory • ~/Desktop/knime_2.9.1 • ./knime.exe • Memory allocation • ./knime.ini Exercises: Starters • More examples available from the Examples repository Exercises: MassCascade https://bitbucket.org/sbeisken/masscascadeknime/wiki/ExampleWorkflows Exercises: XCMS http://www.bioconductor.org/packages/devel/data/experiment/manuals/faahKO/man/faahKO.pdf Exercises: OpenMS http://ftp.mi.fu-berlin.de/OpenMS/release-documentation/OpenMS_tutorial.pdf Final Remarks • Workflows can make exploratory or repetitive data tasks easier and save time • Extensive data pre-processing functionality • Extensions for statistics, machine learning, bio-, and cheminformatics • Integration of R (XCMS) and spectrometry extensions can help you to build elaborate pipelines and share work • Can help to organize one’s thoughts. • It’s actually quite a bit of fun. Resources • KNIME Forum • http://www.knime.org/ • KNIME Learning Hub • http://www.knime.org/learning-hub • Quickstart Guide • http://tech.knime.org/files/KNIME_quickstart.pdf • Happy to Help • beisken@ebi.ac.uk Q&A • • • • • • • • • •