Metadata Organization and Management for Globalization of Data Access with Michał Wrzeszcz, Krzysztof Trzepla, Rafał Słota, Konrad Zemek, Tomasz Lichoń, Łukasz Opioła, Darin Nikolow, Łukasz Dutka, Renata Słota, Jacek Kitowski ACC Cyfronet AGH Department of Computer Science, AGH - UST PPAM 2015 Krakow, Poland, September 6-9, 2015 Agenda Motivation Problems with Global Data Access Is a new tool needed? Onedata Design Assumptions Key Aspects of Data Access Global data organization Globally distributed metadata Results Conclusions Motivation Scientific communities require global access that integrates independently managed resources. Metadata organization and management is a key to make global access effective, simple and convenient. Problems with Global Data Access Storage heterogeneity and delays/bandwidth issue. No accounts integration: Difficult access (security Manual transfer of data issues). before/after computations. Problematic data sharing. Is a new tool needed? Globus Connect iRODS PanFS Parrot GoogleDrive Gluster LFC Dropbox BeeFS Onedata - Design Assumptions All organizations (providers) supporting a user have access to all data and meta-data concerning the given user. No central server for the metadata for the sake of performance and availability. No replication everything to everyone, optimally managing the redundancy data. Data access efficiency: Minimal overhead when the data is close to client. In the case of remote data an efficient fragment access. Onedata - Key Aspects of Data Access Global data organization Hides complexity of data distribution from users Indicates which remote data should be observed by each organization Globally distributed metadata No trust between providers Caching vs. coherency Global data organization Easy management and sharing of data for users. Limitation of metadata that provider should know. Global metadata distribution 3 metadata levels Metadata used to coordinate providers’ cooperation Files metadata stored by each provider Current usage metadata Usage optimization Lower level -> more frequent usage -> higher distribution Caching and aggregation of changes Changes pushing to caches Global metadata distribution Level 1 Supports cooperation (users accounts integration) Provides information which lower level metadata should be synchronized with whom (spaces metadata) Stored by Global Registry – distributed application which works as trusted mediator Global metadata distribution Level 2 Files metadata File parts location description Stored by each provider that supports particular space Fast access to needed metadata Limited number of synchronization operations Propagation of changes on the basis of Level 1 metadata Changes aggregation Automatic conflicts resolution Level 1 metadata caching Global metadata distribution Level 3 Metadata about current files usage Who should be notified about file change Where data is currently modified Stored by providers, cached by clients First aggregation at client side, second at provider’s Updates Level 2 metadata Global metadata distribution Sum up More changes -> lower Global Registry Level 1 level -> more power Level 1 Cache Level 1 Cache Provider 1 Level 3 Provider 2 Level 2 Caching & aggregation vs. time needed to gain global consistency Level 2 Level 3 Level 3 Cache Client Set balance at provider level (dynamic clients reconfiguration) Locks for immediate consistency Results Simplicity Easy organization of data Global distribution hidden Easy results publishing Results Cooperation Results Efficiency Conclusions Data organization allows hiding global distribution from users keeping providers’ independence Ready for global users cooperation Efficient enough for computations Onedata status Onedata v1 installed in production environment of ACC Cyfronet AGH Onedata v2 currently tested by international organizations Thank you onedata homepage: http://www.onedata.org