Mining the rich veins of scientific text, and the barriers against it.
Yesterday I got all giddy abou tthe Voyteks’ semi-automatic hypothesis generating tool. And, then I find this Guardian article which discusses proprietary problems with mining scientific texts. I think I got it from Ben Goldacre’s twitter stream. (If you don’t follow already, do: @bengoldacre).
The problem is getting access to journal materials, so you can use nifty computer tools to sift through text and data to help researchers get a better overview about what is happening. Considering how much information is out there, this is necessary. Researchers need tools to sort through all the rich information that we have gathered in bits and pieces but have not put together, and probably cannot put together withouth help from powerful computer tools. The problem here is not one of technology but one of property. Researchers need permission to access the published texts, and that seems to be…not so easy.
Tal Yarkoni (in his blog Citation needed) has also talked about a similar issue with trying to mine data from published articles. This is not about who owns the rights, though, but how the formatting is done – making it very difficult to harvest the data.
Finally, (and relatedly) there are more places with proprietary information than the scientific journal texts. Lots of us freely spill all sorts of personal information to the world and its spider bots to see. Which is why we get such nice tailor made adds thrown at us. Also, great resource for making wild and interesting research! But, the data will not be shared, alas, so how can other researchers verify that the research is correct? Here is an NYT article outlining this.
I just read an article by Geoffrey Miller Re using Smartphones, and issues with privacy/deidentification and all that. But – I’m getting too sprawled for a post, but this will need a serious discussion.