Document 17867606

advertisement
>> Alex Wade: All right. Well let’s go ahead and get started here. Welcome everybody in the room and those of
you online. My name is Alex Wade with MSR Outreach and it’s my pleasure to welcome today Arfon Smith.
Arfon has had a very interesting career going from being a research scientist to actually getting into GitHub where
he is now, but prior to that he was involved with the Zooniverse platform and was one of the leaders in sort of the
crowd source software space. And now he is turning his attention more fully to how software can be used really
within the research community. So it’s my pleasure to welcome him here today and thank you Arfon.
>> Arfon Smith: Thanks, thanks for having me here today. I also need to apologize immediately for the most stupid
talk title ever. It seemed brilliant when I wrote it, but it only probably means something to me. So the idea being
that these are variables that you can substitute. So we will do that a bit later on. Anyway, yeah, so my background
is research I have a PhD in astrophysics. I very quickly realized --.
>>: [indiscernible].
>> Arfon Smith: What’s that?
>>: My son’s an astrophysicist.
>> Arfon Smith: Oh, really, awesome, good, but it turns out I am a terrible astronomer. So I then had a year in
group writing software in Bioinformatics Institute, which was fun and then yes, as Alex said was doing stuff with
Zooniverse. So I guess all through my career I have been writing software for research domains, initially for myself
during my PhD, but then later more to facilitate the research of others. So a lot of what we are going to talk about
today is sort of, I guess, influenced by those experiences. So yeah, my name is Arfon Smith, you can find me online
as Arfon on GitHub and elsewhere on the web.
I work for this company called GitHub which I think most of you probably know about, but if you don’t it’s pretty
much the largest source of open source software on the web. So when I try and explain what GitHub is to people I
sort of got a bunch of versions and I just had a meeting with half of you in the room. So you guys are probably the
least in need of explaining this, but it’s this distributed version control, we will only spend a couple of slides on it,
that helps with versioning stuff. This is how I explain GitHub to my mum or to somebody non-technical, because
we have all done this even though TrackChanges exist, for some reason and I realize I am on Redman campus now
so I am going to keep my rhetoric down, like TrackChanges has never worked for me beyond about three authors.
People like turn it off, or ignore it, or I don’t know they just, you end up doing this and your version control was
your inbox, which obviously sucks.
And obviously, of course as you are probably aware, we can’t solve the problem very well for Word because line
[indiscernible] are not very well versioned in Git as a technology, but this is a mathematics book that was written by
about 50 people on GitHub. It’s written by 47 contributors with 3,000 commits. This was written by a whole bunch
of authors and they were using this, so it was something like oriented. They are writing in mathematics and writing
in tech, right so they are, this is the equivalent of TrackChanges on GitHub. This is a single commit with showing
line additions, line removals and line by line changes.
So there is a whole bunch of stuff that’s happening on GitHub. This is very small, so if you are writing LaTeX then
this is a file that’s useful to you when using Git, this is the hidden file that tells Git what to ignore in a project. So
LaTeX spits out all these temporary files left, right and center and actually you probably don’t want to version load
as you go. So this is a Git ignore file that’s actually got a whole bunch of people both forking it and starring it. So
people are kind of getting value out of this and then at the very large scale this is a 7 million line code base. It’s C,
and Python and all sorts of other things, but it’s the production software pipeline for the CMS experiment on the
Large Hadron Collider.
So this is one of the two experiments that detected the [indiscernible]. This is a group that is active on GitHub.
They collaborate on GitHub and there are about, I forget how many authors there are, but I think there are about 300
authors on this project. But they are PhD students in labs and they are building the software that produced the data.
It’s not the science part of the experiment it’s the kind of production kind of data release part that manages the
instrument. That’s all happening on GitHub.
So, from the very, very small to the very, very big; I also have to sometimes do this when I speak, which is explain
what researchers do and why I think GitHub is kind of relevant. So I have this kind of story that I like to tell. The
total version is that astrophysics is technical, people are smart, they are solving hard problems, but their research
work flows are brimming with inefficiencies. They are wasting vast amounts of time and this is not because
astronomers are silly people, but it’s because they don’t know about technologies that we know about in the open
source world. So astrophysics is, it should be better lighting, but never mind this is a very nice projector, this is a
telescope, it’s on the top of Mauna Kea, and astrophysics is going to wonderful places using amazing instruments,
seeing the sky like most people don’t and using this. This is the last telescope that I used it’s the Anglo-Australian
Telescope in Australia.
So it’s like bit science, big hardware, but the technology actually looks like this. So this is the control system for
that telescope, so I think there is, I think like VAX hardware underneath this somewhere that actually is in the kind
of production pipeline for data off the telescope. So you know those are the absolute pinnacle of engineering effort
that goes into creating these things and the computational stuff is sometimes behind. And also this is the kind of
rarely seen side of astrophysics. This person, whose name is Steve [indiscernible], is so tired that he has got a Postit note on face telling him to remind me, to remind him, to change something about the configuration of the
hardware, because it’s 3:00am and we have been awake for 27 hours and we are jet lagged, so it’s not that
glamorous.
The point about what you are doing in data reduction is you spend time, you know this is kind of a common pattern
in research, is that your job as a researcher initially is to kind of remove the effect of the instrument that you are
using from the data you have collected. You want to be able to compare data sets from multiple different
instruments. So the detectors that are used in telescopes are typically CCD. So this is what they look like when you
illuminate them and you can see these are quite big physical devices. They are probably like about this large. They
are somewhat larger than the CCD in your phone. But you can see they have got all sorts of weird fringing effects.
And this is kind of internal reflections that go on inside of the device.
And when you zoom in on them and look at them up close, because they are quite expensive devices, but
astronomers sometimes buy cheap versions of these things because your telescope is really expensive so you are
cutting costs on the CCD, you get these stuck pixels, these white ones, these are called hot pixels. So these are pixels
that are not actually detecting light, they are just stuck reading out 100 percent all the time. So you can see that
these columns, if we scroll down the CCD, there is a hot column that runs top to bottom all the way up and down the
CCD. But then if you actually look in further then you see that there are actually lots of these small pixels all across
the device. So there are these pixels in your data set that you want to remove. So this is what a researcher does
when they first get their data from the telescope.
So you open up this viewer and you look at these files, because some of them are really obvious, but some of them
are really subtle. Like this one, you will see that not only is it not stuck on 100 percent, but the response function
has gone a bit weird on the CCD. So you end up making this thing called a bad_pix_mask. And its plain text and it
looks something like this. It describes X1, X2, Y1, Y2 positions for all the crappy data.
Yes?
>>: Is there one of those per telescope?
>> Arfon Smith: Per detector, yes.
>>: So like if you are new to the telescope you don’t have to go and build it for yourself every time? Like
somebody did it like 20 years ago when they first turned it on?
>> Arfon Smith: Wouldn’t that be wonderful, wouldn’t that be amazing if that happened? No, I am about to tell you
what actually happens is that everybody does this every time. So that detector, as I said is brimming with
inefficiencies, that detector is an expensive, you probably only get 3 nights of use on it. So it has two different
groups using it every week. Well, maybe you get 2 days with 3 observing groups using it every week. It takes
about 2 days when you first look at a new detector if you have never done that task before, so a new PhD student
like I was. It takes a couple of days to really get a canonical bad_pix_mask. That detector will probably be on the
telescope for 15 years and so there is about 13 years of human effort, ballpark, wasted doing that task and nobody
shares these things. So these are the most shareable --.
>>: Like no one was like, “Hey you used it Wednesday; how about you give me what you did.”
>> Arfon Smith: Nobody, nope.
>>: [indiscernible].
>>: There is something meditative about doing the same thing over and over.
[laughter]
>> Arfon Smith: No, but there is this idea as well that there is like this right of passage as well, right. No, everyone
has got to make a bad bad_pix_mask. And you know, when you have used the telescope again you probably use
your old one, but you have a quick learn. The problem as well is that they are not absolutely static, so they evolve
right. People drop things into the instrument, or somebody spills coffee, or something. They are not static like in
the person who used the telescope 2 weeks ago you could just share, but you probably wouldn’t want to share the
one from 3 years ago. You probably would want to make a new one or you would at least want to version the one
and then we could all iterate and collaborate around it.
So if the most extreme used case here and I am being deliberately extreme here, but is that it is perfectly feasible that
you might spend a decade of human effort writing these things. And I would like to say that this I unusual, but I
think is normal actually in research. This kind of inefficiency where the things that we produce as researchers, the
things that we only think about really ever are the papers at the end. We don’t thing about the products of our
research as we go. So this I kind of a deliberately extreme example, but I think if ever there was a use case for the
kind of most human time wasted on a single file it would be probably bad_pix_mask, because years per byte is the
unit I think.
Anyway, so how does open source community’s kind of deal with problems like this? How do they collaborate
around stuff? Well, the first thing I wanted to do because I think that this is sort of an experienced is audience, I
want to just draw a slight difference between open source and open collaborations, because I think there is a subtlety
there that matters, especially in academia. And so very broadly and hand wavy I am not a licensing guru or law
trained, but open source is about a right to modify the underlying thing that you have stumbled across. You know,
there is some kind of contract there that you have confidence that you can take what’s been published and you have
a right to modify and reuse that work. But you don’t actually, with an open source license, there is nothing there
saying that you can contribute back to that project. That is there is no right to contribute there.
So this is the kind of difference between open source and open collaborations and this is where GitHub and
academia are, or what I think where academia can learn a lot from what happens on GitHub, is that open
collaborations are often also open source projects. The product of these communities are, but the idea is that anyone
who is competent, if they make a valid contribution then they might have a reasonable expectation of that
contribution going back into the project.
>>: I mean if Solomon were here though he would be hopping up and down saying that the whole intent of copied
left was that, you know, if I fork from you and then did some additions then I had to contribute it back. So the intent
was --.
>> Arfon Smith: So I might have to make it available, but would it go back into the canonical source?
>>: True.
>> Arfon Smith: So that’s kind of what I am saying. If I stumble across a live read that has got an open source
license on it I might do some amazing work, but the --.
>>: It’s great for forking, but not for [indiscernible]?
>> Arfon Smith: Right, so maintainers might have no interest in taking contributions back.
>>: And you have to develop a whole community around them.
>> Arfon Smith: Right, and so it’s this kind of open collaboration that --. So I guess today I want to sort of talk
more about how academics, or the academic model, could work well for open collaborations and also in open source
and open science.
So what’s different comparing and contrasting what I just kind of explained in that situation with bad_pix_mask?
Well, in open source there is this culture of reuse, right. So there is no shame in picking up the work of someone
else and building upon them. In fact this is just, this is actually core to how the expectations are or the norms of the
community are. But, there is something more than that and there is something that academics really don’t have, is
that the environments that in which open source software is often built is actually lower friction then academic
environments. So e-mail Word documents or sitting in a room together are not models that scale when you have got
something like 4,000 people working together on a project.
So the piece of GitHub I think is most interesting, and this is I guess my sales pitch, is what happens on pull
requests. And I think this is pretty much the only feature that I am going to talk about today that’s about GitHub.
And I think a really good example of this, and again sorry some of these slides are probably a little bit inappropriate
for this audience, but this is a project called Homebrew, which is a thing that you put on --. It’s not inappropriate
because of what I am about to say, you now you put it on your MAC and it allows you to get some software
dependency on your computer. So if you want My SQL or you want some kind of C library installed or some kind
of weird [indiscernible] bit of software then there is this package management thing called Homebrew, but it is an
open source project. It has got 3,700 contributors. So, how the hell does that work?
Well the way it works is that a whole bunch of people take copies of this by forking on GitHub. So when I click that
button on GitHub, if you haven’t done it, you basically get asked where you want to take a copy of this code base
and where you want to take it too. And I exist as an individual on GitHub, so I am in a few organizations that
belong to GitHub, these ones, but I have also interacted with most [indiscernible] science labs, so I am on some of
their projects as well. But, I basically take my own copy. We call this forking, we fork the code, and basically you
bring it into your name space on GitHub. So this is effectively a named spaced branch of the code base. And then I
just go away and I write my new formula that goes into Homebrew. This is where I go and do my work and this has
happened tens of thousands of times on GitHub.
I was going to see if we can see the number of pull requests on Homebrew, but this happens an enormous amount,
actually 28,000 there, 28,700 nearly. The important point about this though is that I can do my coding first and then
I can ask for permission later. So if I am interested in making a contribution I don’t have to go and ask for
permissions on the repository. I don’t have to go and ask permissions to become a contributor. I can fork the code
base, go off and do the work and lead with examples. So code first, permission later and I think this is certainly
breaks down a barrier when trying to advocate for a change. You can actually go and make that change before you
have to advocate for the option to make the contribution. You can go and do the work and lead by example. So this
is actually the longest I wanted to prove that GitHub eats its own dog food and uses GitHub to build GitHub.
So this is GibHub/GitHub, the code base internally and I this is, I think, our longest ever running pull request or
most common pull request. And it was by somebody who wanted to change the about page. You know this is a
screen grab of that page and you can see there are the initial code confusions on the top. This is a design change and
a functionality change; you can see a whole bunch of people commenting and a whole bunch of people commenting.
These are code additions here, comments, new images, more code, more code and then when you get to the bottom
of the page you can see that PJ hire merged commit to master at the end. So that merge to master was actually the
change going live in the production.
So pull requests, merging pull requests are the core kind of part of how people collaborate on GitHub, but I think
one of the things that are interesting to me is when you have projects that are open like: what does the collaboration
look like? What is the actual difference between open source and open collaborations? Like what is the expectation
of a contributor or potential contributor to their changes going back into the library that they are interested in?
So this is a project called Redis. So this is a key value store in memory store. It’s a bit like mem cache; it’s a very
popular caching layer for web applications. And this is time across at the bottom here. We have got about three
years of data and on the Y axis we have got the fraction of pull requests that get merged for a given month. So the
size of the dot here is actually how many pull requests were open that month and what we are looking at here is that
in this month there were only a couple of pull requests, but they are all merged in. But over time this project is, well
I am an astronomer so I see a trend there, but we can argue about that if you don’t, but you can see that basically this
project is getting more popular. These dots are getting bigger, but actually the fraction of pull requests that are open,
like the fraction that’s getting merged in is getting smaller.
Compare that with something like Ember which is a pretty popular JavaScript web framework and this is a wildly
popular project. This is now only actually about two years worth of data, in fact a little bit less than that. This a
project that is growing in popularity. This is a more open collaboration then Redis, right. Redis, they are both MIT
licensed open source projects, but one is actually taking a large number of community contributions here and I think
this is important. And this is actually a library in astrophysics, a very popular project called Astropy and again you
can see the project is getting a lot more popular. There is about two year’s worth of data here, there is a bad month
at the end here, but ignore that we will rerun the data, but they are pretty good at taking contributions.
But, sometimes things like this happen. So this is Homebrew, it is written by a British guy who thinks that the
world “formulae” is cool and Americans get upset with that because Americans formulas and occasionally you see
63 files change. This is somebody saying, “Actually it should be formulas” and about once a month, I know one of
the maintainers of Homebrew, about once a month somebody open a pull request, it takes them a ton of work,
actually they probably they just do a find and replace, but they say they want to rename formulae and that never
goes well for them because most of the maintainers are British and they don’t care.
So there is a governance thing here when you have 4,000 people contributing to code, but at the same time somehow
these open collaborations manage to create great net worth to the community. There is a really, really wonderful
blog post that if you are interested in open source and academia by a guy called Fernando Perez, he is one of the
lead engineers on the IPyhton notebook if you are familiar with that. He is at Berkeley and he has this kind of well
argued thing that the academic environments today are that papers are the kind of combination or the only thing that
get’s credited in academia.
There is no actual real incentive to share your research products other than your papers because there is no credit as
a researcher. It kind of has this idea that open source products, the ones that are distributed, people working in
different languages are reproduced by necessity. There is no way you can collaborate with 100 other people if you
cannot quickly evaluate somebody’s work and understand the contribution they made and whether it’s a good
contribution or a bad contribution. The way that these communities exist is more producible and one that academics
can learn from.
So people who collaborate well on GitHub are better at collaborating because they have to be, because if they
weren’t they are often not in the same room, they probably don’t know each other, they might not speak the same
language, but they have to find a way of working together or otherwise their project just wouldn’t grow. So what’s
the thing that open source brings to academia? Well I think, does open have to mean public? I think this is a really
important question. I think Alex and I have been bouncing around today on what does it mean? Is this kind of way
of working a sort of call to action for an open science kind of way of working? Do we have to share everything we
are doing as we go? And I don’t think it does. I think open can be kind of, there is different stages of openness.
There is openness within your team, within your department, within your institution or maybe with the whole world.
But, the kinds of principals are the same that what you are doing should be electronic. This is not meet spaced based
communication, but something that’s electronic and that’s available to everybody. Preferably, and I think this is
something where GitHub does actually give you something, the branching we were just talking about before. The
idea that people can go off in different directions, make a feature branch work and then the technology is actually
helping you with merging back in later. But the process as you are working is exposed. And I think that’s a really
important principal that’s immediately transferable to academics. So again, multiple people working on the same
thing; this is a lock free environment. But, overall we are talking about some kind of low friction collaboration that
open source has that academic typically don’t.
But, this isn’t completely true. There are some interesting things happening in academia today. And I
wanted to just take a chance to kind of talk through some good examples I think of open source
collaboration on GitHub. So this is from, this is sort of very small, this is a guy called Dan ForemanMackey, he is a astrophysics PhD student at MYU and he has written this really nice sampling library that
is up on GitHub as a Python framework, but his initial release was for his own, I think he was doing some
extra planet research, but his initial release was about a year or so ago now, but he basically had a
reasonable number, we are talking about 100 now pull requests, but people doing really useful things. So
stuff that was not on some published roadmap, but that has value to other folks. So this is a guys who
basically there was initially a pull request to make the library work well in an MPI environment and now
somebody is actually optimizing the task allocation in that environment. So this is a library, a piece of
software that’s growing in terms of it’s function and use for the wider community all through pull requests
from people that Dan doesn’t immediately know.
This is something that is sort of lab-scale I guess. This is Titus Brown at Michigan State University, so this
is a library that does; this is a DNA analysis routine. It think it’s looking at K whatever, I don’t know what
that term is, I think it’s some kind of inconsistency in the DNA structure, but this is a piece of software that
this is the lab publishing this. This is a research group probably 10 or 15 people working on it and then you
have got like full scale, and another thing is any accident that this is a division of the NIH. This is a whole
institute that cancer NCIP publishing all of their software curated list. They have got a couple of, yeah 160
public repositories up on GitHub, everything they do is public and these are the kinds of core software that
they want people to use. This is a kind of curated set and that’s software.
But, there’s really interesting stuff happening in the collaborative authoring environment too. So I started
with that math text book, but this is a wonderful example of the SciPy conference proceeding. So if you
want to present at SciPy then you --. Actually this is the master branch, but I think there is a
2012/2013/2014 branch. If you want to submit a paper to SciPy you fork the repository and you make a
pull request with your paper. And the review happens in the pull request and acceptance to SciPy is
emerged back into the branch on the SciPy organization. So I think that’s pretty neat.
And then there is a bunch of stuff happening in teaching actually. So I think the IPyhton is a really
significant advance in this basis. This is a scientific Python lecture series. I think it’s worth noting that
starting on GitHub is a bit like book marking. You know 250 people have book marked this, but 59 folks -.
You’ve got a question sir?
>>: It’s sort of a high level question. As you get to more complex environments or [inaudible]. I mean
when I was an under graduate to turn in your first computer science homework assignment you had to use a
special version of [inaudible] and [inaudible].
>> Arfon Smith: Right.
>>: That got rid of 200 people out of a 600 person class because they couldn’t figure out to turn in their
homework.
>> Arfon Smith: Right, that was an idea. It was sort of an unfair way to weed people out.
>>: [inaudible].
>> Arfon Smith: Yeah, so I think training is a huge factor in adoption. So there is actually a really good
training program called software carpentry you might have heard of which is run by the Mozilla science lab
these days, which sort of takes on this challenge. You are a new undergraduate or post graduate and
somebody says Python and you are like, “What is that?” And you are going to go and both learn how to
structure code and documenting code, but also one of the core components of that is versioning and GitHub
in fact actually.
>>: On the other hand you can argue that if you are going to be a scientist and you are not [inaudible]. I
mean there are a number of scientists who basically probably lost data and who can’t show it. So 20 years
from now I think people will be like, “Yeah”, but if these people are going to be scientists shouldn’t they be
learning how to install with their data. You can’t just write in a journal anymore.
>>: In 20 years, if you are a professor now and in 20 years you will, and if you need to turn in a paper you
know the GitHub method, but the new thing that’s out there you would be like, “What the heck? Student
can you submit the paper for me?” And not try and deal with it. I am sure my professor can program in
[inaudible], but [inaudible].
>>: But part of these [indiscernible] talks is teaching people the skills they need for the jobs that don’t
exist. In the sense that pushing sort of the state of the art so that people are being trained for the next
generation. So aren’t’ we ready for this?
>>: [inaudible]. There is a subtle aspect of that workflow that I am not familiar with how it would work on
GitHub, which is the submission is now visible to everyone else.
>> Arfon Smith: Sure.
>>: So my homework should not be readable by all my classmates and the paper might not, for every
conference, be visible to all the other submitters until it’s been merged. So is it just a workflow assumption
that it’s okay for it to be visible in those concepts?
Arfon Smith: So in teaching, so the way that people use it is it’s a private repository typically and so the
student will fork the professors repository and then submit a pull request, which is private because the
repository is private.
>>: [inaudible].
Arfon Smith: They are doing auto scoring. So when they get the pull request they have got some stat for
how, like they have scored it.
>>: [inaudible].
Arfon Smith: Yeah, but for this, for the SciPy this is perfectly better. But, I think there are communities
now, the [indiscernible] library I showed before, every time Titus Brown submits a grant proposal he
publishes it on submission. He’s like, “I don’t know if I am going to get this money, but here’s what I
asked a million for from the NSF.” Yeah, there are people who are just okay with that. I am not arguing
that everybody is ready for that, but some people are doing it.
So these Python lectures both look like this when they are in their executable form or in their rendered form
I should say, but this is a kind of growing space for us. But, I feel like I have to explain this stupid title. So
I will come back to this, so that was the set up I guess. So I am going to do some variable substitution and
there is a telescope called the LSST, which some of you may have heard of. This is a large synoptic’s
telescope and this is the evolution of this lone digital sky survey, which some of you will be familiar with.
So, SESS was something that Jim Gray had a lot of influence over with Alex [indiscernible] at John’s
Hopkins University. So LSST is the next kind of generation of sky survey telescope. I should show you a
picture of it; it’s incredibly bad ass looking. This is about an 8 meter mirror here. There is a version of an
EU sized rendered person in it and it would be about this big here. I didn’t realize --.
>>: It’s EU size?
Arfon Smith: Yeah, there is an official EU size probably. There are regulations for everything isn’t there.
But, this thing is going to scan the visible sky. There is they sky it can see every three nights. So the Sloan
robotics survey would image, I think it took a decade, to image about a quarter of the sky depth. This thing
is going to image all it can see every three nights. It’s basically going to do video astronomy. And it’s
going to have two core data products. It’s going to have a continual stream; they call it level 1 data, of
things that look like this. So when you are imaging the sky every night sometimes things are going to have
changed. You are going to look at the say and say, “Huh, there is a new thing there, it wasn’t there the
night before and that might be a low flying rock, like maybe an asteroid, or it might be a might supernova
or it might be a Gama ray burst. They call these things transient, so things that are changing in the sky.
So it’s going to shift, it’s going to be the data string 10 to the something; you know that could be 5 or 8.
The number you get kind of depends on who you talk to. There’s going to be probably hundreds of
millions of events coming off the data pipeline every night live streaming and it’s going to ship with a
small postage sized stamp image of the sky with a crosshair saying, “That thing and here’s some basic
information about it.” And then it’s up to the community to go and point their telescopes at it. And then
there is going to be the level 2 data. So this is like the data release, a yearly cadence, big data releases, all
the imaging that’s done over that year with a reduced data product, a lot more effort going into the data.
This is a telescope that’s inherently open because of the way it’s being funded. It’s a jointly funded
between the department of energy and the NSF and ATL. It’s kind of expensive, it’s like half a billion
dollars or something. So it’s inherently open because of its data policy. It’s got kind of an open data
mandate. Andy of us in the US can get all of the data if we want it, if we can take a copy of it. If you are
outside the US I think they are encourage research institutes to kind of buy a license to access the data of
something like 20,000 per year per institution.
So within the US it’s inherently a kind of open data project. And actually moreover there’s actually a
really cool opportunity, when you are spending a half a million dollars building a telescope people like the
NSF and [indiscernible] want you to think quite carefully about what type of instrument you are going to
build which is fair. So they set up all these working groups who were interested in different science cases.
So you have got people who are interested in supernova, and galaxies, and the large scale structure of the
universe, dark energy and AGN. So each of these groups have been spending about the last 7 or 8 years
sitting down regularly 3 to 4 times a year saying, “Well what kind of science do we want to do? And what
kind of instrument would we need to build to have our science served best?”
So there is lots of science that’s possible with LSST, but these groups have basically informed the design of
the instrument. So they have been working together for close to a decade. So they have helped to develop
the telescope and I think there is an opportunity for groups like this to do more. So the question is: Where
do the communities form? And certainly with LSST it’s around a shared challenge. They are interested in
doing the same science. You can get a bunch of people in a room together to talk about the problem they
are interested in solving. So LSST has both their interest in the shared challenge and there’s also like this
flies to the honey pot, you know what the phrase is. There is shit core shared data product they are all
going to have to use that are all going to start with the same data products.
So I think there is an opportunity, there are these hundreds and hundreds of researchers who have shaped
the design, but they actually have opportunity to do more than that. They are all going to do basically the
same science, or very similar science. They have all have their own interests, but they have got an
opportunity with LSST I think to do something interesting. I think for me this kind of argument applies to
any large centralized science project. They should basically just copy open source. They should copy what
open source communities do in terms of how they build like the core parts of their software. And here’s
why: in open source there is this kind of ubiquitous culture of reuse. It is like people spend time reading
other peoples code, copying each other’s code, having an opinion on implementation and maybe making
their own implementation, but people are using each other’s work.
And I think the best example I have of with like my recent experience with this is a friend of mine who is
also an astronomer is currently riding a bike across America, which is quite impressive I think. He is
following the route of somebody who did this in 1884 on a penny-farthing, which is even more impressive,
and he is somewhere in Nevada right now. Actually I think he is in Nebraska now, but this guy Stuart Low
is carrying not something dissimilarly looking to this. He has got a non-smart phone and he has got quite a
nice GPS that gives him latitude and longitude and we had a conversation when he was in San Francisco
and he said, “I am getting on this bike and I really want to have a map that shows where I am, I want to
update a map.” And I said, “Well have you got a smart phone?” He said, “No I haven’t got a smart
phone.” I said, “Well what have you got?” He said, “I have got an old Nokia phone and a Garmin GPS.”
And it was like all right, I am sure we can do something like this. And he had already actually, to be fair,
he had this is a file format called GeoJson so it’s a geo spatially aware Jason structure that we render quite
nicely on GitHub if you upload it. So he had already worked out his path and he said, “It would be really
cool to update a map.”
So I was like, “Ah, we must be able to do this.” So I went to RubyGems, I typically write Ruby if I am
trying to get something done, searched for SMS and something like 50 lines of Ruby later the story here is
not that I am a good software developer, the story is that this was not hard. 50 lines of Ruby later we have
got a little service where when he’s done at the end of the evening he sends a text message with latitude and
longitude and it updates the map. So it’s just a deliberately small example, but this software to be able to
achieve that you have got many components in what you build, like many components in your software
stack, but the point for me about that story is that my kind of 50 lines of code were about the thing that’s
different. Like the value at, the thing, the particular thing I want to achieve. I have no idea how to receive
a text message as a web service, but somebody else had built me that library already. I had a particular use
case; my use case was I had a GitHub repository with a map in it and I have somebody who wants to send
messages and I want to put those two together. That was the thing I was trying to build and I think this
should be the same as research as well.
>>: [inaudible]. So I think the issue with science, especially astronomy, is big data. So like Jim [inaudible]
who worked on this for years was fond of saying that sneaker net was still doing better than internet at
delivering astronomy data for really big data. So I mean fundamentally sharing the source code is fine, but
really when you are collecting this data a lot of what the scientists want to do is show the data so they can
run their different experiments. So same data different programs, of course sharing the programs is great,
but fundamentally they need to share the data. So how do you sort of see interacting with the whole big
data movement where every science is generating tons and tons of data where whether you are a biologist
or astronomer it’s easier to store.
>> Arfon Smith: Yeah, so Git does not help you in any way with data. I mean Git’s not an appropriate
technology for that.
>>: Yeah, but how do you integrate? I mean because a program when you execute is the source code plus
some inputs so having, I understand outside of GitHub, but to make this really work for science it seems
like you need [inaudible].
>> Arfon Smith: Yeah, so there are a bunch of people who are working on solutions that do this and so I
would say there’s a collection of implementations that have varying likelihoods of successes I would say in
my opinion. And they are like research work flows that are executable that try and combine and compute
with data with some kind of version of providence capture of the software you are executing. The kind of
current breeds do something pretty smart where they insist on reference and accessible data stores. So you
have got: Where are you? I am in my compute environment. Okay, well here’s some abstraction of what it
looks like to run a work flow. But, where do those resources come from? And the resources can be
referenced as a binary at a URL or a Git repository.
So basically people are programming solutions together typically where they are referencing some of the
software on GitHub and the data might be local or it might be an API that they are querying. I mean that’s
sort of, I mean in LLST the data sets are projected to be, I mean I know Morris Law and it’s not wise to
argue against that, but I don’t foresee petabytes of data on their laptop in 6 years of time. I mean I think we
can work that out. So people aren’t going to do this locally. So LSST has already kind of got a model
where there is going to be a few centers, and I think one is an NCSA, where they are going to have a large
probably virtualized compute environment where you can run your compute. So the opportunity here is
not, I mean so I agree that data and [indiscernible] are not the problem --.
>>: I understand Git has already made a separation. There are other web services that provide build, like
you don’t provide build as a service for example. So you have never been in the business of providing
other tooling other than sort of your value add for collaboration.
>> Arfon Smith: Yeah.
>>: Now to sort of pile onto that, astronomy is an interesting example to pick here, because with the
[inaudible] and the LSST the notion that you need to devote not an insignificant amount of the funding of
the project to figure out how you distribute the data is unique. It does not happen in other domains. And
Jim Grey was talking about this a long time ago and it’s now getting to the point where that notion, or the
expectation of the funding, that you will share that data. And that solves how you pay for it, where it goes
and how’s that sustained, but that is sort of coming to fruition and it’s now leading to the “Okay, then
what”? What happens when we get to an environment where everybody is sharing the data? What do we
do with it when there is environment for sharing code?
>>: Well we know what they’ll want. What they will want eventually is to say, “Stop writing your
software from scratch, the data exists, here’s the framework for which you develop your algorithms and
stop translating from 17 different formats and spending your grad student for 5 years, leverage your
framework.”
>> Arfon Smith: And this for me is the opportunity where not everyone is producing their own data and it’s
not feasible to host your own copy of the data. You are not going to build your analysis T, you are
probably not going to have your own copy of the data. But, the first, I don’t know if you are thinking of
standard data reduction, 95 percent of the operations that you perform in those data are not actually
uniquely your contribution. They are just like the leg work, some massaging of the data, probably quite
primitive analysis before you do the bit that’s the actual contribution. So this is where I think research has
an opportunity to learn from open source. The software that you are writing as an academic should not be
the whole pipeline. The data reduction pipeline should be there and your value adds the academic
contribution should be the 5 percent on the top. But, that’s not where we are today.
So if that’s somewhere we want to get to, like not many people are doing it today, what are the significant
barriers to actually making that behavior both possible and in fact normal? So I think the first one is credit.
As an academic today you would write a paper and the mechanisms for awarding that IE tenure
professorship are understood, but this [indiscernible] is just down the road and [indiscernible] was a launch,
I was going to say party, but I don’t thing it was a party. It was an event for the co-funded program
between the Moore and the Sloan foundations to build kind of the assertion that the data science is the
bridge between lots of both computationally and data rich domains where methods are kind of key to the
domain. And the fact that the environment that we exist in today in university as a culture that we have is
that people who build tools, people who write software don’t actually get academic or equitable credit for
that kind of behavior.
So when you are looking at something like this tool that’s come out of the [indiscernible] lab, they go to
enormous lengths to point out that you should be sitting them for this work. It’s in capitols because I guess
they are shouting at us, but there is a five cord citation which basically says that this is a piece of research
software. This is a DOI, this is how you can reference it, you should cite this, and this is scientific
software. If we go back to that astro pie project I was showing the contribution graph from before this is
how 75 contributors, this is a project that’s being built by fairly early career post docs. They have got
something like, they are a pretty healthy GitHub project, and they have got nearly 200 forks and a couple of
hundred stars. They actually wrote a paper about this project. If you look there is a badge, I think this is
the Python package index that will give you like this flare badge, you know it’s had 32,000 downloads in
the last month, this piece of software. It basically used by pretty much all astronomers world wide and yet
the paper about this project has had 28 citations, so there is something going wrong here.
But, this is the only real established way to collect carrier points right now, is to write a paper about the
code. But, David [inaudible] nailed this when he said that publishing a paper about code is just advertising.
The actual scholarship in scientific software is really often in the source code, not in the PDF that you
choose to share. So you know these are significant numbers, these are significant measures of reuse
forking, starring a number of people watching something, but right now we don’t have an established
number of people who are pushing on this, but there isn’t an established, recognized by tenure committee,
way of giving a value metric to these kinds of open contributions that we see here. So I encourage you to
look at some of the [inaudible] metric stuff if you are interested.
So this feels very appropriate to be putting this slide up here, but I think as research becomes more data
intensive, as we are moving into this domain of methods and the tools that we build are becoming more key
to actually doing any research then there is a great talk by a lady called Victoria [inaudible] who is at
Stanford ands she talks about this crisis in reproducibility. How we are becoming as more and more of our
scholar activities are encoded in software and this isn’t actually published typically then we are becoming
less and less reproducible as a domain. This is my XKDC summary of her talk as we get more data
intensive we get less reproducible. I am never showing her this; hopefully she wouldn’t be too offended,
it’s an excellent talk.
So what are barriers? I think trust is a barrier. I think people are both scared of sharing, but also like: How
do I trust somebody’s software? Like when I went and downloaded that Ruby library to do SMS pausing
it’s called [indiscernible], like something about this number made me maybe think it was probably okay.
You know 600,000 downloads, I mean I assume that’s got some value, that number, but deeper than that I
can actually go and look at the code obviously. If I really care I can go and read the source code, I can go
and get a code climate measure, I can see that they have at least written some tests, I can go and see those
tests executing and I can gain confidence with that and I can actually go and see what code climate says
about churn and quality of what they are writing. So there are tools and ways that we can do that and I
think --.
>>: [inaudible].
>> Arfon Smith: Yeah, so these are all third party integrations here.
>>: So can you just say a little more about the code for the next one.
>> Arfon Smith: For the code climate?
>>: Yeah, what they are offering.
>> Arfon Smith: So this is --.
>>: What they are offering.
>> Arfon Smith: Oh, they look at --.
>>: So they build tests.
>> Arfon Smith: They don’t do CI so this is Travis CI.
>>: Yeah, Travis I know.
>> Arfon Smith: So Travis does the CI and code climate does like method length kind of duplication and I
forget, but I think it looks at churn as well. But, it tries to give you a score. So this is nearly green. I think
4 is --.
>>: Okay, so more static co-complexity.
>> Arfon Smith: So these are all things that can give me confidence. But, there is also a serious problem
with academic software and more generally sort of niche domains where you are lucky if you are working
in a wildly adopted field where there is a standard way of publishing. If you write a RubyGem, which is
kind of the de-factor way of packaging up some functionality in a library you put it on RubyGems. This is
where you find these things. I mean most people who have got shipping RubyGems are publishing them
here. So there is a kind of de-factor place to go and get them, but discoverability is hard I think in
academic domains. So knowing that somebody has even written software similar to you is a serious
problem.
But, in truth I think most of the barriers that we have are cultural and not technical. They are about like
how academia works, not that there aren’t tools to service the needs of academia. So for academic
audiences I kind of encourage them to think about what they can do today and I think there is a kind of
wisdom coming from people that we should just stop talking about open science and just start doing it and
see where it breaks, where is works and where it doesn’t work. So when people are sharing I kind of
encourage them to think about why they are sharing. Is it because they want credit, are they sharing
because they are trying to create an open source community? They are trying to get collaboration
happening online. I think this is a really important question. But, I think that the general message for
research communities is we want to try sharing a bit more often, but if you are doing that then certainly put
a license on that because that matters. So we give a lot of providence to this on GitHub.
And also that documentation matters, because the sharing story I want to tell in the future is not this. It is
that there is like this step first, which is some Git clone of the latest version of that bad pixel mask, because
that’s the kind of thing that’s still lasting with me a decade after writing that thing. So when it comes to
telescopes like LSST I think the open data argument is the absolutely least we can do. There is so much
more in terms of opportunity. And this is not about open science; this is actually accelerating how science
happens. So that is me, thanks.
[clapping]
>>: So going back to the [inaudible]. Do you have any metrics on how well reused the various components
in the framework work? I remember both working on that kind of ten years ago, but I believe the
[indiscernible] has shifted significantly in ten years.
>> Arfon Smith: So in terms of actual reuse or mandating sharing?
>>: People’s willingness to share from the bottom up as opposed to in compliance with the mandate.
>> Arfon Smith: Yeah, my understanding, so three years ago I was writing grants to the NSF and there was
this data management plan that you had to write and you still do.
>>: [inaudible].
>> Arfon Smith: And yet I think that my understanding is that only in the last six months has anybody not
got a grant because of what they wrote in their data. I mean you can write I have got a shit ton of software,
pardon my French, and then I am going to burn it all at the end. Like I am not going to share it with anyone
and that was okay.
>>: That was okay?
>> Arfon Smith: That was okay; you just had to say what you were going to do. So that was NSF and I
still believe you can still say this is going to be a closed source project because we want to write a ton of
papers and we are not in competition. I think people would be like, “Really, is that what, are you really
going to do that?” But it certainly is a better to say this will be open source and open data.
>>: [inaudible].
>> Arfon Smith: NIH, my understanding, is quite a long way ahead of that. I mean I think it’s still more of
a mandate. I was at an IH event a few weeks ago and Phil [indiscernible] --.
>>: Director of Data Science.
>> Arfon Smith: Yeah, director of data science is there to effect cultural change. And his joke is, well what
I do in my second week, but I think that’s a serious challenge. But, yeah, I don’t know. That’s a very long
winded way of saying I don’t know what, but I know that definitely in the US there is a difference in the
agency level as to whether they are just asking for you to say what you are going to do or you have to
adhere to this policy.
>>: [inaudible].
>> Arfon Smith: I mean the White House said you had to, right.
>>: And there is legislation in committee this week to put that into effect. The expectation is still turning
now to you will make your data available, unless --.
>> Arfon Smith: Unless you have got some --.
>>: Human subject’s data.
>> Arfon Smith: Right.
>>: What is unclear is who the agency has complied with that when it’s a practically non-funded mandate.
>> Arfon Smith: Right, make me share but --.
>>: Whose funding this and how is it being sustained?
>> Arfon Smith: Right, this is an opportunity right for people with cloud platforms and those kinds of
things.
>>: Can you talk briefly, getting back to what you were talking about with getting credit for code, talk
briefly with the project you did with Mozilla and how that’s going right now.
>> Arfon Smith: Oh sure, so I was going to show you something but, I don’t think that’s a very good idea, I
was going to go to web browser. So right now there is a growing movement to make more things called
research objects, so things in a researcher’s portfolio that you might want to site. So the idea being that you
might want to site a data set or you might want to site a piece of software or maybe even just a figure from
somebody else’s work or something. So there are some services, one called [indiscernible], there’s Dry
Add, there’s [indiscernible], there’s a bunch of organizations that are taking on the burden of hosting data
and giving some kind of guarantee of its preservation.
And they are issuing things called DOI, Document Object Identifiers, and these are basically a bit like a
URL, but not HTTP. They are reference able string that you can use, you can site and use these DOIs to
reference something, maybe it’s a paper, maybe it’s a piece of code. So just over the last couple of months
I have been working a little bit with the Mozilla science lab, [indiscernible] and [indiscernible] and we
basically made it really easy to get a DOI for GitHub repository. So, both archives in the repository and
just gluing together a couple of APIs, so there has actually not been much code written on the GitHub side,
but just making sure that those groups were kind of walked through how to integrate. And yeah, so there
are, I think somebody was saying, about 2,000 pieces of software of how to DOI issue now which is pretty
neat.
>>: [inaudible].
>> Arfon Smith: And I hear they are all credible pieces of research software that you would actually really
want to site. I mean right now a good example for me would be in cosmology there is a piece of software
called Gadget which is used for making your own universe in your computer. And this thing is on some
guy’s website. You download a toggle and he says, “Please site me like this and use this URL”, but if he
leaves that university and that web space goes away a whole bunch of people are just not going to be able
to --. There is a preservation and longevity issue with lots of research software. So yeah, apparently the
things that people are archiving and getting DOIs for are really significant contributions so this is cool.
There is a broader problem that some journalists don’t let you reference which sucks, it has to go in the
appendix, but we can beat the journals up about that and many journals do now.
>> Alex Wade: Any more questions? All right, well thank you very much.
>> Arfon Smith: Cool, thanks for hosting me.
[clapping]
Download