Open Science on “Future Tense”

Posted by Dan on February 5, 2010 at 3:30 pm | Categories: Uncategorized | No Comments

Yesterday’s “Future Tense” radio program on Australian Broadcasting was just posted online. The topic was Open Science, and I managed to get interviewed for the show. The interview with Anthony Funnell was a great conversation, and he’s pulled out some of the better bits while making the Open Science movement sound only slightly utopian.

Share and Enjoy:
  • Digg
  • Technorati
  • Slashdot
  • del.icio.us
  • Reddit
  • StumbleUpon
  • connotea
  • LinkedIn
  • FriendFeed
  • Google Bookmarks
  • Posterous
  • Twitter

If you’re going to do good science, release the computer code too

Posted by Dan on February 5, 2010 at 3:02 pm | Categories: Science, Software, open science | No Comments

A very nice aarticle by Darrel Ince has just been posted over at the Guardian. It deals with the climate-gate email theft and the quality of academic science code has just been . An excerpt:

Computer code is also at the heart of a scientific issue. One of the key features of science is deniability: if you erect a theory and someone produces evidence that it is wrong, then it falls. This is how science works: by openness, by publishing minute details of an experiment, some mathematical equations or a simulation; by doing this you embrace deniability. This does not seem to have happened in climate research. Many researchers have refused to release their computer programs — even though they are still in existence and not subject to commercial agreements. An example is Professor Mann’s initial refusal to give up the code that was used to construct the 1999 “hockey stick” model that demonstrated that human-made global warming is a unique artefact of the last few decades. (He did finally release it in 2005.)

Share and Enjoy:
  • Digg
  • Technorati
  • Slashdot
  • del.icio.us
  • Reddit
  • StumbleUpon
  • connotea
  • LinkedIn
  • FriendFeed
  • Google Bookmarks
  • Posterous
  • Twitter

Kitware has a blog!

Posted by Dan on January 29, 2010 at 10:59 am | Categories: Software, open science | No Comments

Geoff Hutchinson just pointed us to the new blog over at Kitware (the makers of VTK).  I’ve found VTK enormously helpful in the past (particularly the source to vtkMath.cxx) and I’m glad they’ve made the commitment to Open Source.

My favorite post so far: Why Open Source Will Rule Scientific Computing by Will Schroeder.

Share and Enjoy:
  • Digg
  • Technorati
  • Slashdot
  • del.icio.us
  • Reddit
  • StumbleUpon
  • connotea
  • LinkedIn
  • FriendFeed
  • Google Bookmarks
  • Posterous
  • Twitter

Being Scientific: Fasifiability, Verifiability, Empirical Tests, and Reproducibility

Posted by Dan on December 1, 2009 at 4:42 pm | Categories: Open Data, Science, open science | 1 Comment

If you ask a scientist what makes a good experiment, you’ll get very specific answers about reproducibility and controls and methods of teasing out causal relationships between variables and observables. If human observations are involved, you may get detailed descriptions of blind and double-blind experimental designs. In contrast, if you ask the very same scientists what makes a theory or explanation scientific, you’ll often get a vague statement about falsifiability. Scientists are usually very good at designing experiments to test theories. We invent theoretical entities and explanations all the time, but very rarely are they stated in ways that are falsifiable. It is also quite rare for anything in science to be stated in the form of a deductive argument. Experiments often aren’t done to falsify theories, but to provide the weight of repeated and varied observations in support of those same theories. Sometimes we’ll even use the words verify or confirm when talking about the results of an experiment. What’s going on? Is falsifiability the standard? Or something else?

The difference between falsifiability and verifiability in science deserves a bit of elaboration. It is not always obvious (even to scientists) what principles they are using to evaluate scientific theories,[1] so we’ll start a discussion of this difference by thinking about Popper’s asymmetry.[2] Consider a scientific theory (T) that predicts an observation (O). There are two ways we could approach adding the weight of experiment to a particular theory. We could attempt to falsify or verify the observation. Only one of these approaches (falsification) is deductively valid:

Falsification Verification
If T, then O
Not-O
If T, then O
O


Not-T T


Deductively Valid Deductively Invalid

Popper concluded that it is impossible to know that a theory is true based on observations (O); science can tell us only that the theory is false (or that it has yet to be refuted). He concluded that meaningful scientific statements are falsifiable.

A more realistic picture of scientific theories isn’t this simple. We often base our theories on a set of auxiliary assumptions which we take as postulates for our theories. For example, a theory for liquid dynamics might depend on the whole of classical mechanics being taken as a postulate, or a theory of viral genetics might depend on the Hardy-Weinberg equilibrium. In these cases, classical mechanics (or the Hardy-Wienberg equilibrium) are the auxiliary assumptions for our specific theories.

These auxiliary assumptions can help show that science is often not a deductively valid exercise. The Quine-Duhem thesis[3] recovers the symmetry between falsification and verification when we take into account the role of the auxiliary assumptions (AA) of the theory (T):

Falsification Verification
If (T and AA), then O
Not-O
If (T and AA), then O
O


Not-T T


Deductively Invalid Deductively Invalid

That is, if the predicted observation (O) turns out to be false, we can deduce only that something is wrong with the conjunction, (T and AA); we cannot determine from the premises that it is T rather than AA that is false. In order to recover the asymmetry, we would need our assumptions (AA) to be independently verifiable:

Falsification Verification
If (T and AA), then O
AA
Not-O
If (T and AA), then O
AA
O


Not-T T


Deductively Valid Deductively Invalid

Falsifying a theory requires that auxiliary assumption (AA) be demonstrably true. Auxiliary assumptions are often highly theoretical — remember, auxiliary assumptions might be statements like the entirety of classical mechanics is correct or the Hardy-Weinberg equilibrium is valid! It is important to note, that if we can’t verify AA, we will not be able to falsify T by using the valid argument above. Contrary to Popper, there really is no asymmetry between falsification and verification. If we cannot verify theoretical statements, then we cannot falsify them either.

Since verifying a theoretical statement is nearly impossible, and falsification often requires verification of assumptions, where does that leave scientific theories? What is required of a statement to make it scientific?

Carl Hempel came up with one of the more useful statements about the properties of scientific theories:[4] “The statements constituting a scientific explanation must be capable of empirical test.” And this statement about what exactly it means to be scientific brings us right back to things that scientists are very good at: experimentation and experimental design. If I propose a scientific explanation for a phenomenon, it should be possible to subject that theory to an empirical test or experiment. We should also have a reasonable expectation of universality of empirical tests. That is multiple independent (skeptical) scientists should be able to subject these theories to similar tests in different locations, on different equipment, and at different times and get similar answers. Reproducibility of scientific experiments is therefore going to be required for universality.

So to answer some of the questions we might have about reproducibility:

  • Reproducible by whom? By independent (skeptical) scientists, working elsewhere, and on different equipment, not just by the original researcher.
  • Reproducible to what degree? This would depend on how closely that independent scientist can reproduce the controllable variables, but we should have a reasonable expectation of similar results under similar conditions.
  • Wouldn’t the expense of a particular apparatus make reproducibility very difficult? Good scientific experiments must be reproducible in both a conceptual and an operational sense.[5] If a scientist publishes the results of an experiment, there should be enough of the methodology published with the results that a similarly-equipped, independent, and skeptical scientist could reproduce the results of the experiment in their own lab.

Computational science and reproducibility

If theory and experiment are the two traditional legs of science, simulation is fast becoming the “third leg”. Modern science has come to rely on computer simulations, computational models, and computational analysis of very large data sets. These methods for doing science are all reproducible in principle. For very simple systems, and small data sets this is nearly the same as reproducible in practice. As systems become more complex and the data sets become large, calculations that are reproducible in principle are no longer reproducible in practice without public access to the code (or data). If a scientist makes a claim that a skeptic can only reproduce by spending three decades writing and debugging a complex computer program that exactly replicates the workings of a commercial code, the original claim is really only reproducible in principle. If we really want to allow skeptics to test our claims, we must allow them to see the workings of the computer code that was used. It is therefore imperative for skeptical scientific inquiry that software for simulating complex systems be available in source-code form and that real access to raw data be made available to skeptics.

Our position on open source and open data in science was arrived at when an increasing number of papers began crossing our desks for review that could not be subjected to reproducibility tests in any meaningful way. Paper A might have used a commercial package that comes with a license that forbids people at university X from viewing the code![6] Paper 2 might use a code which requires parameter sets that are “trade secrets” and have never been published in the scientific literature. Our view is that it is not healthy for scientific papers to be supported by computations that cannot be reproduced except by a few employees at a commercial software developer. Should this kind of work even be considered Science? It may be research, and it may be important, but unless enough details of the experimental methodology are made available so that it can be subjected to true reproducibility tests by skeptics, it isn’t Science.


  1. This discussion closely follows a treatment of Popper’s asymmetry in: Sober, Elliot Philosophy of Biology (Boulder: Westview Press, 2000), pp. 50-51.[]
  2. Popper, Karl R. “The Logic of Scientific Discovery” 5th ed. (London: Hutchinson, 1959), pp. 40-41, 46.[]
  3. Gillies, Donald. “The Duhem Thesis and the Quine Thesis”, in Martin Curd and J.A. Cover ed. Philosophy of Science: The Central Issues, (New York: Norton, 1998), pp. 302-319.[]
  4. C. Hempel. Philosophy of Natural Science 49 (1966).[]
  5. Lett, James, Science, Reason and Anthropology, The Principles of Rational Inquiry (Oxford: Rowman & Littlefield, 1997), p. 47[]
  6. See, for example www.bannedbygaussian.org[]
Share and Enjoy:
  • Digg
  • Technorati
  • Slashdot
  • del.icio.us
  • Reddit
  • StumbleUpon
  • connotea
  • LinkedIn
  • FriendFeed
  • Google Bookmarks
  • Posterous
  • Twitter

On Reproducibility

Posted by Dan on November 23, 2009 at 9:02 am | Categories: Conferences, Open Data, Policy, Science, open science | 3 Comments

I just got back from a fascinating one-day workshop on “Data and Code Sharing in Computational Sciences” that was organized by Victoria Stodden of the Yale Internet Society Project. The workshop had a wide-ranging collection of contributors including representatives of the computational and data-driven science communities (everything from Astronomy, and Applied Math to Theoretical Chemistry and Bioinformatics), intellectual property lawyers, the publishing industry (Nature Publishing Group and Seed Media, but no society journals), foundations, funding agencies, and the open access community. The general recommendations of the workshop are going to be closely aligned with open science suggestions, as any meaningful definition of reproducibility requires public access to the code and data.

There were some fascinating debates at the workshop on foundational issues; What does reproducibility mean? How stringent of a reproducibility test should be required of scientific work? Reproducible by whom? Should resolution of reproducibility problems be required for publication? What are good roles for journals and funding agencies in encouraging reproducible research? Can we agree on a set of reproducible science guidelines which we can encourage our colleagues and scientific communities to take up?

Each of the attendees was asked to prepare a thought piece on the subject, and I’ll be breaking mine down into a couple of single-topic posts in the next few days / weeks.

The topics are roughly:

  • Being Scientific: Fasifiability, Verifiability, Empirical Tests, and Reproducibility
  • Barriers to Computational Reproducibility
  • Data vs. Code vs. Papers (they aren’t the same)
  • Simple ideas to increase openness and reproducibility

Before I jump in with the first piece, I thought it would be helpful to jot down a minimal idea about science that most of us can agree on, which is “Scientific theories should be universal”. That is, multiple independent scientists should be able to subject these theories to similar tests in different locations, on different equipment, and at different times and get similar answers. Reproducibility of scientific observations is therefore going to be required for scientific universality. Once we agree on this, we can start to figure out what reproducibility really means.

Share and Enjoy:
  • Digg
  • Technorati
  • Slashdot
  • del.icio.us
  • Reddit
  • StumbleUpon
  • connotea
  • LinkedIn
  • FriendFeed
  • Google Bookmarks
  • Posterous
  • Twitter

Sad news about Warren DeLano

Posted by Dan on November 7, 2009 at 11:59 pm | Categories: Science, Software, open science | No Comments

I just heard the sad news about Warren DeLano, one of the giants of open source scientific software (and the author of PyMOL). Warren passed away suddenly a few days ago. Like everyone else, I’m stunned and saddened by this news. For those of you who don’t know, PyMOL is a fantastic piece of software that has produced some of the highest quality journal covers in the past decade. I met Warren only a few times. But over the years, we’ve had a few fantastic conversations about Jmol, PyMol, and how to make open source sustainable in the scientific world. I’ll miss his voice and his contributions to the community.

Warren’s family has created a “In Memorium” page and blog. Please share your memories there !

Share and Enjoy:
  • Digg
  • Technorati
  • Slashdot
  • del.icio.us
  • Reddit
  • StumbleUpon
  • connotea
  • LinkedIn
  • FriendFeed
  • Google Bookmarks
  • Posterous
  • Twitter

What, exactly, is Open Science?

Posted by Dan on July 28, 2009 at 11:45 am | Categories: Open Data, Policy, Science, open science | 26 Comments

I was recently asked to define what Open Science means. It would have been relatively easy to fall back on a litany of “Open Source, Open Data, Open Access, Open Notebook”, but these are just shorthand for four fundamental goals:

  • Transparency in experimental methodology, observation, and collection of data.
  • Public availability and reusability of scientific data.
  • Public accessibility and transparency of scientific communication.
  • Using web-based tools to facilitate scientific collaboration.

The idea I’ve been most involved with is the first one, since granting access to source code is really equivalent to publishing your methodology when the kind of science you do involves numerical experiments. I’m an extremist on this point, because without access to the source for the programs we use, we rely on faith in the coding abilities of other people to carry out our numerical experiments. In some extreme cases (i.e. when simulation codes or parameter files are proprietary or are hidden by their owners), numerical experimentation isn’t even science. A “secret” experimental design doesn’t give skeptics the ability to repeat (and hopefully verify) your experiment, and the same is true with numerical experiments. Science has to be “verifiable in practice” as well as “verifiable in principle”.

In general, we’re moving towards an era of greater transparency in all of these topics (methodology, data, communication, and collaboration). The problems we face in gaining widespread support for Open Science are really about incentives and sustainability. How can we design or modify the scientific reward systems to make these four activities the natural state of affairs for scientists? Right now, there are some clear disincentives to participating in these activities. Scientists are people, and we’re motivated by most of the same things as normal people:

  • Money, for ourselves, for our groups, and to support our science.
  • Reputation, which is usually (but not necessarily) measured by citations, h-indices, download counts, placement of students, etc.
  • Sufficient time, space, and resources to think and do our research (which is, in many ways, the most powerful motivator).

Right now, the incentive network that scientists work under seems to favor “closed” science. Scientific productivity is measured by the number of papers in traditional journals with high impact factors, and the importance of a scientists work is measured by citation count. Both of these measures help determine funding and promotions at most institutions, and doing open science is either neutral or damaging by these measures. Time spent cleaning up code for release, or setting up a microscopy image database, or writing a blog is time spent away from writing a proposal or paper. The “open” parts of doing science just aren’t part of the incentive structure.

Michael Faraday’s advice to his junior colleague to: “Work. Finish. Publish.” needs to be revised. It shouldn’t be enough to publish a paper anymore. If we want open science to flourish, we should raise our expectations to: “Work. Finish. Publish. Release.” That is, your research shouldn’t be considered complete until the data and meta-data is put up on the web for other people to use, until the code is documented and released, and until the comments start coming in to your blog post announcing the paper. If our general expectations of what it means to complete a project are raised to this level, the scientific community will start doing these activities as a matter of course.

If you meet a scientist who tells you that they did a fantastic experiment and have wonderful data, you naturally ask them to email you a reprint. Any working scientist would be perplexed if the response was: “Oh, I’m not going to be writing this work up for publication.” It would be absolute nonsense in the culture of science to not publish a report in a journal on the work you have done. And yet, no one seems surprised when scientists are too busy or too secretive to release their data to the community. We should be just as perplexed by this. Instead of complaining about the reward and incentive systems, we should be setting the standard higher: “What do you mean that you haven’t got around to putting your data on the web? You aren’t done yet!” Or: “How can I possibly review this paper if I can’t see the code they were using? There’s now way for me to tell if they did the calculation right.” We’re going to have to raise the expectations on completing a scientific project if we want to change the culture of science.

Share and Enjoy:
  • Digg
  • Technorati
  • Slashdot
  • del.icio.us
  • Reddit
  • StumbleUpon
  • connotea
  • LinkedIn
  • FriendFeed
  • Google Bookmarks
  • Posterous
  • Twitter

Saros: Distributed Pair Programming

Posted by Dan on June 26, 2009 at 4:11 pm | Categories: Science, Software | 1 Comment

pairon I’m a big fan of pair programming, which is one of the primary modes of software development in my research group. Usually, two people sitting together can spot errors that one alone can’t, and the pace of the coding and debugging is often much higher than two people sitting separately. I don’t know if my graduate students are as appreciative of this technique as I am — how many students want their advisor right next to them for the entire afternoon, taking over their keyboard, and seeing all the IM requests coming across the screen? But as an researcher, I find it gives me a much greater feel for what we’re actually doing in the lab, sort of like a small group meeting where we’re both looking at the same data or the same plot. It would be great if there were a way to separate the pair-programming from the “sitting in the same cramped cubicle” part of the equation.

Christopher Oezbek just let us know about a cool open source Eclipse plugin called Saros. This lets two people sitting in different locales collaboratively edit and work on the same project. I’ve seen similar things in the editor SubEthaEdit (which is not open source), but Saros will let two programmers do this at the project level (with multiple files open), not just at the file level. It looks like a very cool tool to avoid those overcrowded cubicles (or the famous PairOn chair pictured above).

Saros is listed in our Software Engineering and Tools sections.

Share and Enjoy:
  • Digg
  • Technorati
  • Slashdot
  • del.icio.us
  • Reddit
  • StumbleUpon
  • connotea
  • LinkedIn
  • FriendFeed
  • Google Bookmarks
  • Posterous
  • Twitter

Machine learning open source software

Posted by Dan on June 12, 2009 at 8:37 am | Categories: Software, open science | 3 Comments

mloss.org Cheng Soon Ong just emailed me about mloss.org, a community creating a comprehensive open source machine learning environment. Mloss.org is essentially a community portal with lots of detailed information about each of the listed projects. One of the more interesting features of their site is that they’ve tied specific software to publication in an associated journal, the Journal of Machine Learning Research to make it easy for users of the software to find and maintain a citation trail to the work of the original developers. The journal itself encourages open source submissions and automatically ties publication of papers related to the software to appearance at the portal.

This last bit is a very clever idea. Would a broader electronic journal (perhaps the Journal of Open Science) would be a useful way to give open projects (Open Source, Open Data, Open Notebook) more citation currency?

Share and Enjoy:
  • Digg
  • Technorati
  • Slashdot
  • del.icio.us
  • Reddit
  • StumbleUpon
  • connotea
  • LinkedIn
  • FriendFeed
  • Google Bookmarks
  • Posterous
  • Twitter

Scientific Software Wants To Be Free

Posted by Dan on May 26, 2009 at 12:20 pm | Categories: Policy, Science, open science | 4 Comments

Go read this wonderful manifesto over at arXiv: Astronomical Software Wants To Be Free: A Manifesto by Weiner et al. The authors talk about some of the barriers to astronimical software development that are true in all scientific fields. The chief barrier they see is that there are no incentives (and are some real disincentives) for authors to release software and documentation to other users. The recommendations are great (modified here only to include all scientific fields):

  • We should create an open central repository location at which authors can release software and
    documentation.
  • Software release should be an integral and funded part of projects.
  • Software release should become an integral part of the publication process.
  • The barriers to publication of methods and descriptive papers should be lower.
  • Programming, statistics and data analysis should be an integral part of the curriculum.
  • There should be more opportunities to fund grass-roots software projects of use to the wider community.
  • We should develop institutional support for science programs that attract and support talented scientists who generate software for public release.

The whole thing is a great read. Check it out!

Share and Enjoy:
  • Digg
  • Technorati
  • Slashdot
  • del.icio.us
  • Reddit
  • StumbleUpon
  • connotea
  • LinkedIn
  • FriendFeed
  • Google Bookmarks
  • Posterous
  • Twitter
Next Page »

Powered by WordPress with Pool theme design by Borja Fernandez.
Entries and comments feeds. Valid XHTML and CSS. ^Top^