Data mining for genetics Hannu Toivonen Premises of the research • Computational methods for medical molecular genetics • Immediate and important applications: – locating disease predisposing genes is essential for understanding the etiology of complex common diseases, such as heart disease or asthma • Focus on selected topics where – we can have a significant impact – we can combine our algorithmic and data analytical expertise with the unique research on medical genetics in Finland Computational methods • Pattern discovery – How to find frequently occurring phenomena • Markov Chain Monte Carlo (MCMC) – Finding posterior distributions for many-dimensional distributions Present state of the group • Leading researchers: Profs. Hannu Toivonen, Heikki Mannila, Jaakko Hollmén • 5 post-docs, 7 PhD students • Collaboration with leading groups in medical genetics – Prof. Leena Palotie (Public Health Institute) – Prof. Juha Kere (Karolinska Institutet) • Various forms of collaboration – Shared personnel (Kismat Sood, Juha Muilu, visiting researchers Kenneth Lange and Joe Terwilliger) – Seminars etc. haplotype (~chromosome) case case case case case case case case control control control control control control control control marker locus Gene mapping 1 2 4 7 5 3 1 5 2 7 3 2 3 1 4 2 4 4 5 2 2 4 2 3 4 3 4 5 3 6 2 2 8 3 2 3 4 3 1 3 7 7 3 2 1 4 8 4 2 7 4 7 6 7 5 7 1 7 2 4 2 5 4 9 2 3 5 5 2 3 2 3 3 5 5 3 4 5 2 5 1 2 5 4 4 1 5 2 4 7 3 1 2 5 3 4 2 8 2 5 2 3 2 1 1 8 2 3 1 9 5 4 6 4 6 2 6 3 6 4 4 6 3 6 4 1 2 2 2 2 4 2 1 4 2 3 8 6 2 2 2 3 5 4 allele Gene mapping case 1 4 case 2 4 case 4 5 case 7 2 case 5 2 case 3 4 case 1 2 case 5 3 control 2 4 control 7 3 control 3 4 control 2 5 control 3 3 control 1 6 control 4 2 control 2 2 pattern 1: (3)(4) pattern 2: 8 3 2 3 4 3 1 3 7 7 3 2 1 4 8 4 3 2 2 1 7 3 2 4 5 5 7 5 4 6 2 4 7 3 1 5 2 5 7 3 2 1 3 4 7 5 7 2 5 3 4 3 1 2 4 2 5 5 5 4 2 3 9 5 4 7 (3)(2) (5) 2 8 2 5 2 3 2 1 1 8 2 3 1 9 5 4 6 4 6 2 6 3 6 4 4 6 3 6 4 1 2 2 2 2 4 2 1 4 2 3 8 6 2 2 2 3 5 4 2 6 (2) Highlights • A novel method, Haplotype Pattern Mining (HPM), for gene mapping based on association analysis – search for haplotype patterns that are associated with the disease status – HPM is the first method to use the concept of patterns and to take an algorithmic approach to the problem – HPM has later been extended to cover a variety of different cases, and recently we showed how haplotype patterns can be found in genotype data and used for gene mapping without explicit haplotypes. Highlights (continued) • TreeDT method for association analysis – TreeDT looks for tree structured haplotype patterns – the patterns reflect possible recombination histories of genes in the population – gene localization is based on the most plausible of such histories • Oligogenic models for binary traits – Bayesian inference using recurrence risk data and Markov chain Monte Carlo (MCMC) simulation methods Highlights (continued) • A new method for haplotyping – Problem: given a set of remotely related genotypes (unordered allele pairs from a pair of haplotypes), reconstruct plausible haplotypes – The new method is based on Markov chains of variable order, and applicable to genetically larger regions than previous methods • Populus simulator – A very useful tool for method development in population genetic studies Proposed research directions (1/2) • Trend towards higher efficiency in wet labs: – high-throughput genotyping technologies – high-density marker maps of biallelic SNP markers – bigger sample sizes • Research issues – scalability of computational methods • number of data points, number of dimensions – effective utilization of this richer information • e.g. haplotype block structure Proposed research directions (2/2) • Goal of decreasing per-genotype laboratory expenses with computational techniques – Pooling of DNA samples from several individuals and genotyping the pooled samples • how to recover haplotypes from pools? • how much information is lost in pooling? • how to design pooling studies? – Selecting a subset of the available markers for a future study or a diagnostic test • determine a small set of markers sufficient to reliably identify the whole haplotype Conclusions • Common characteristics of the tasks: – combinatorics – stochastic data – optimization problems, non-obvious objective functions – conserved patterns due to shared genetic origin Adaptive Computing Systems Patrik Floréen, HIIT SAB meeting 17.10.2003 Outline • Introduction • Present activities • Future activities What is Adaptive Computing? • • • Adaptive computing refers to solutions that adapt to their environment Linked to the ubicomp / pervasive computing / proactive computing vision We focus on some central topics to realise this vision: – Context-awareness and adaptation is central to user-friendly ubicomp applications and ad hoc networking may in the future provide infrastructure for many ubicomp applications Fundamentals • • • Draws on existing competence in data mining, probabilistic reasoning, algorithmics and language technology At the crossroads of many of the research groups of HIIT: many of our research groups deal with contextawareness, personalisation and adaptation This presentation mainly about the ACS group at BRU – 8 persons – New! Started this year Present projects: CONTEXT • • • • • Nov 2002-Dec 2005, Academy of Finland Hannu Toivonen, with ARU Characterization and analysis of information about user's context and its use in proactive adaptivity: what is the user's understanding of her current context, how to make automatic inferences about the contexts, and how to characterize context to users and design user interaction about contexts Research approach: qualitative end user studies, data analysis algorithm development, and empirical testing in a prototype environment Manuscript is in preparation. The user experience research (undertaken in ARU) has been reported at Intl Symp. Human Factors in Telecommunication Present projects: NAPS • • • • • • • Jan 2003-Dec 2005, Academy of Finland Patrik Floréen, with HUT Fundamental topology control and routing problems in ad hoc networks Research approach: algorithmics, in particular graphtheoretic optimisation First topic studied is topology control: multicasting lifetime maximization under energy constraints Results presented in MobiCom 2003 workshop; journal paper in preparation Next directions: sensor networks Present projects: NAPS results • • • • • • • • • Model: Energy consumption function of transmission power (graph representation) Input: Node energies and power threshold graph Output: Optimal power assignment schedule Usually: min. energy & static power assignments Here: max. lifetime & dynamic power assignments Result 1: max. multicast lifetime & static: polynomial Result 2: max. multicast lifetime & dynamic (discrete time steps): NP-hard and APX-hard (As part of proof: certain Steiner tree packing problem proved NP-complete) Result 3: 2 heuristic algorithms and a method to calculate an upper bound; algorithms give good results Present projects: Space4U • • • • July 2003-June 2004, Nokia, ITEA/EUREKA project subcontract Patrik Floréen Context-aware selection of software components in mobile phones Project just started Present projects: PROACT coordination • Jan 2002-May 2006, Academy of Finland (Tekes, French Ministry of Research) • Program director Heikki Mannila, Coordinator Greger Lindén • Coordination of the Research programme on Proactive Computing (14 projects with 41 partners) – Promoting collaboration between projects and to improve national and international contacts – Follow-up of projects and programme – Assisting in administrative procedures Principles for building the future • There is potential for expanding the activities • Collaborative projects with other groups (also outside of HIIT) • Attention to recruitment of postdocs • Seek to engage in TEKES projects • Optimal size of research group is maybe 15 persons Future research topics • • Continuation and expansion of work on context inference and reasoning, as well as continuation of work on fundamentals of ad hoc networking New research topics: – Context-aware personalised information retrieval – Potential of groups of users in the form of context sharing, group learning and social navigation – Distributed solutions, both algorithmically and from a software engineering point of view distributing the intelligence into the surroundings Intelligent Systems Henry Tirri Recent achievements People and collaboration • Professor Henry Tirri, Dr. Wray Buntine, Dr. Jorma Rissanen, Dr. Petri Myllymäki and many excellent graduate students General Goal The aim of our research is fundamental understanding and development of computationally efficient probabilistic and information-theoretic modeling techniques, and their multi-disciplinary applications from engineering to sciences. B-course (http://b-course.hiit.fi) Some other recent highlights • Application of the modeling to intelligent educational software Outstanding Paper Award (SITE 2002) • Honorable mention for being 2nd (out of 114 international groups) Knowledge Discovery and Databases prediction competition (pharmaceutical application) (MDL-solution) • E-government: Modeling the voting behavior of Finnish Parliament members 1999-2002 (http://cosco.hiit.fi/eduskunta/index.jsp) • Mobile Device positioning: “Manhattan trial 2003” - the most accurate GSM-phone urban positioning (<30m) available Future Next Generation Information Retrieval Research emphasis • PROSE: Probabilistic modeling based analysis engine and query processor; mathematical methods • SIB: semantic-based information management integrated with personalization; “network appliance”; specialized interfaces • ALVIS: distributed architecture; topic-specific gathering; sophisticated language processing NGIR features • semantic search built on automatically performed analysis, not just human-tagged content – probabilistic modeling – ”shallow” language analysis • integrated personalization • integrated collaboration • ”intelligent” interface • distributed architecture • topic-specific search Aspects of ”Search” Theory Theory topics • Kolmogorov’s structure function interpretation of MDL (with related rate distortion theory) • Model-based similarity metrics • Normalized Maximum likelihood (i.e., stochastic complexity) for flexible graphical model families • Computationally efficient models (mPCA, ICA) for LARGE-SCALE text retrieval CoSCo papers Applications Multi-disciplinary applications • Medical domains (“P-Course”) • Some biology stuff (mPCA for genome modeling with Michael Jordan) • Analysis software for social sciences (e.g., educational “qualitative” analysis) Sensor networks • • • “Grounded Web” = computing + communication + sensing applying mobile positioning & modeling research for selforganization and modeling Scaling challenge Scientific impact • Regular channels: publications, books, conferences etc. • International networking (cooperation & recruiting, graduate education) • Open source code releases (search, e-learning environments, modeling software) • Publicly available servers (search, data analysis) • Cool demonstrations :-) HIIT Digital Economy (DE) Jukka Kemppinen Martti Mäntylä Olli Pitkänen HIIT ARU Premises for the Research • Digital Economy refers to legal, societal, and business issues that are specific to the network society • The rapid development of information and communication technologies challenges traditional ways to structure, organize, analyze, and regulate the activities in a society: – Digital contents: rights, distribution, … – Balancing different stakeholders interests: privacy, trust, … – Roadmap to future: whose future will prevail? • The research interests of HIIT’s DE group are related to solving these problems Present State of the Group • DE group founded in 2000 – Longer roots at Helsinki University of Technology • About 18 researchers – Professor Jukka Kemppinen, leader – Professor Martti Mäntylä – Olli Pitkänen, program coordinator • Currently four projects, funded by Tekes and companies: – DE Core, MobileIPR, STAMI, Welfare of Nations – The group’s strengths include especially issues in intellectual property rights, digital rights management, open source licensing, and security Digital Economy 2003 UCB/ICSI BCIS UCB/SIMS Kemppinen & Himanen Kemppinen Welfare of Nations MobileIPR DRM & business models Kemppinen & Mäntylä Network society study DE Core Structures of Digital Economy Mäntylä STAMI Security technology, personal privacy Competences • Present disciplines in the group: – Information technology – Law – Political science – History – Philosophy • More breadth may be needed - economics? • Partnerships: – HUT, UH; Helsinki School of Economics, Lappeenranta University of Technology, Research Institute of the Finnish Economy, Bank of Finland Recent Achievements • Papers, journal articles, reports • Working prototypes to study certain aspects of the future digital services – Digital rights management – Secure and accountable peer-to-peer content sharing – Micromovie sharing (under preparation) • First International Mobile IPR Workshop: Rights Management of Information Products on the Mobile Internet, keynote speakers: Professor Hal Varian (UC Berkeley) and Professor Ross Anderson (Cambridge) Future Research Directions (1/2) • In the next few years, the DE group is going to focus on the following topics: – The structures of the network society, including value networks, transaction costs, relations between actors, pricing mechanisms, and business models (DE Core) – Digital goods, information products and services, rights in them, value of the rights, products and services, and rights management (IPIS) – Open source as a modus operandi, development model, licensing models, economic analysis of copyright, incentives, and user rights (RIPOS) Future Research Directions (2/2) – Privacy, trust, and economy in P2P content distribution (MUPPET) – Models of information and welfare societies, digital sociology and politics, international comparison esp. Chinese information society (WoN II) – Digital economy issues in manufacturing industries? • Continued co-operation with University of California, Berkeley, School of Information Management and Systems (SIMS) • New partnerships in Europe? Digital Economy 2004 UCB/ICSI BCIS UCB/SIMS Kemppinen Himanen IPIS Welfare of Nations II Intellectual Property in Information Society Dynamics of Chinese Information Society Kemppinen & Mäntylä DE Core Structures of Digital Economy Kemppinen Mäntylä RIPOS MUPPET Risks and Prospects of Open Source Privacy and Trust in P2P Communication