Internet2 innovation Working Group Distributed Big Data & Analytics (DBDA) 1 Chairs call – June 11, 2015 Attendees: Chairs – Alex Feltus - Clemson, Sam Gustman - USC, Marc Hoit – NC State Internet2 - Florence Hudson, Rick McMullen, Bob Brammer, Khalil Yazdi, Ann O’Beay, Giselle Trent Opportunity for Internet2 to add value in Distributed Big Data & Analytics: • Internet2 good for moving data, consider extending to support data analytics • Internet2 is able to convene the community and communities-of-practice at scale, in U.S. and globally • Internet2 is able to bring services forward at scale and tuned to researcher needs, working with both Internet2 internal and external service providers • Internet2 can work with the community to identify good models of collaboration across IT, research & libraries on campuses to handle the "data problem" (responding to agency requirements, IP control, compliance, etc.) Challenges the Internet2 community and users face re DBDA: • Hard to use the network – complicated sets of IT issues – need a cookbook for researchers – how to use the network, get access to services, use services • Networking, nationally and globally – many layers of challenges (including – the problem of fast networks, slower networks, slow end devices such as spinning disk) • Data analytics and proximity of data and analytical tools • Distributed data, analytics and network speeds • Inclusion of librarians and those responsible for curation, archiving and Internet2 innovation Working Group Distributed Big Data & Analytics (DBDA) 2 preservation • Researchers don't know what Internet2 is or how to access and leverage the network, including for their big data needs • Current research data sharing is "excel" scale, vs. needed Exascale, Genomics needs Gigascale • Need to serve the "missing middle" • End to End 100Gb connectivity internationally, or even domestically, is "impossible" ... today • Even within a country or region, there are different layers of connectivity ... some @100Gb, some at 1Gb...some slower, etc. • Standards can take ~24 months or longer to bake, then uptake starts, based on input from InCommon report • Storage bottlenecks can be a challenge, corollary to network bottlenecks • User "pre-sales" and "sales" support needed, to help potential users prepare for Internet2 network connectivity, then use it most effectively • Measuring TCP window sizing • Distributed big data repositories and how to get the data through Internet2 • Need less dilution of case studies, to share core technical information • Need cyber-infrastructure experts AND storage experts AND research scientists on the phone to set up a system that will work when you turn it on • Need end to end performance monitoring and tuning and support • Traffic shaping rules • Identify network and storage and computing bottlenecks • Network may be considered speed problem, but the true bottleneck can be writing to local spinning disk • Especially when going over 3 or 4 international research networks (e.g. USC to Europe to Prague) • Network trouble shooting • End user infrastructure and support desk needed for Network trouble Internet2 innovation Working Group Distributed Big Data & Analytics (DBDA) 3 shooting • How best to aggregate data and transfer • Internet2 interacting with CIO organizations, but not researchers • Digital preservation of data • => make sure it's not physically rotting...newer technology rots faster Potential strategies to address DBDA challenges: • Communicating what is available to researchers – they don’t know what they may have access to • Establishing relationships between and among researchers and service providers (university and industry) around the particulars of use cases – need to develop a Community of Practice • Establish community protocols for data access and sharing of data • Managing data repositories • Engage NSF • NSF big data hub • Regional focus...e.g. GIS in southeastern US • Identify a few potential use cases • Power grid monitoring use case, with geographically isolated sensors...Petabytes of data needed for regional sensor data from power grid monitoring, near real time data aggregation, Internet2 in proposal as a partner for SouthEastern US GIS regional project • Economically underserved researchers in U.S. or Africa, thy are note currently accessing large data sets and don't know it's there or how to access • Agriculture • Libraries • Enable community sharing of core technical information and details of use case studies • Create CORBA - Common Object Request Broker API to query across datasets Internet2 innovation Working Group Distributed Big Data & Analytics (DBDA) • Include libraries, scientists and engineers • Create the "Library of Things" API 4