Skip to content

Alert: New app for analysing p-values.

OOOOOOH, nice new shiny stats app: The p-checker

(Posted on shiny apps site no less).

I hold Felix Schönbrot responsible (he tweeted it in).

I haven’ played around with it yet, but it must be shared, and must be placed in a place where I might find it later.

I couldn’t help but associating it with those lines on pregnancy tests, though. So, now I want to have a line that indicates some kind of “yes”.

Looking for unpublished data for Creativity Meta-Analysis. Plz spread the word

I’m one of the supervisors here, and would like to spread the word (and maybe get some data). After all, Doing more meta-analyses is likely part of fixing science!

Subject: Meta-Analysis: Call for Unpublished Data on the Relation between Creativity and Self-efficacy

Meta-Analysis: Call for Unpublished Data on the Relation between Creativity and Self-efficacy

We conducting an exhaustive search of the published literature, and are now making a call to gather findings that are unpublished, or soon to be published. We are also interested in unpublished thesis data. We are specially interested in the zero-order correlations between ANY creativity measures (also self-rated creativity) and self-efficacy beliefs (general self-efficacy, as well as creative self-efficacy).

If you believe your study qualifies for inclusion, we are requesting details about the characteristics of the measurements, as well as of your sample, plus study design. The associated effects sizes are also desirable.
Alternatively, we would be happy if you can provide us with your data and any information required to determine how the variables might be coded.

We will only use the data for the purpose of the meta-analysis and we will delete the data afterward.
You can contribute your unpublished data via email to mailto:
Similarly, if you have any questions about this study, please do not hesitate to get in contact.
Thank you for the assistance and contribution to our work. We will gladly send you a copy of the meta-analysis once it is published.

Best regards,

Jennifer Haase

Master student at Lund University, Department of Psychology

Eva Hoff, Ph. D.

Lund University, Department of Psychology

Åse Innes-Ker

Lund University, Assoc. Prof. Psychology

Two critiques, and a faith restorer

I wanted to share links to some recent blog posts that I thought were interesting. The first is by Scott Atran (who researches terrorism) posting on Peter Turchin’s Social Evolution forum. Scott recently had a commentary up in Nature discussing how difficult it is to even get permission to study terrorism (in part due to ethics committees being set up to protect middle-class students, as he claims). The post is an interesting discussion on research on humans, past and present (and much of psychology is, of course, research on humans). Scott Atran. Psychology, anthropology and a science of human beings/

The Faith restorer is from Michael McCulloughs “Social Science Evolving” blog, where he discusses a p-curve exercise that he used in one of his classes. He had his students setting up teams, and then select a literature for which they did p-curve analysis. For all 10 topics, the data showed evidentiary value! It sounds like such a good project for students to do, and make me feel a bit better this day of numerate chickens.

I thought of that blog as a temperate response to this post on the Error Statistics Philosophy blog.

I think her critique is fair. But, there is evidence among the less charismatic of areas in social psychology.

2014 in review

The stats helper monkeys prepared a 2014 annual report for this blog.

Here’s an excerpt:

A San Francisco cable car holds 60 people. This blog was viewed about 2,000 times in 2014. If it were a cable car, it would take about 33 trips to carry that many people.

Click here to see the complete report.

Stapel’s derailment – Now in English thanks to Nick Brown.

Nick Brown (Who blew up positive Psychology’s metaphorical use of Lorenz Butterfly attractor as just so much nonsense)        took it upon himself to translate Stapel’s autobiography. And, he is making it available for free. Right here. Come on, down load it. I know you want to.

I did.

On Trust, and the Process of Science

Some weeks ago, there were two tweet-streams that were about trust in science.

The first included Akira O’Connor’s successful campaign against a rejection based on a single review wherein he was accused of p-hacking. Evidently he is not alone when it comes to this experience. From being a high-trust endeavor, where you might have accused people of doing inane and misguided research, there is now suspicion that you are fudging research (but see data-Coladas excellent tutorial on how to respond to suggestions of p-hacking).

The second was from Keith Laws, stating that pre-registration is not checking the sloppiness and the HARKing, as journals don’t always hold the researchers to their preregistration.


In short supply.

When I re-read David’ Hull’s “Science as a process” this summer I ran across his claim that scientists very rarely falsified results. That is not because scientists are a particularly virtuous group – he really strongly states that scientists are human with all the foibles of ambition, self-serving biases, querulousness as well as the standard issue of nice traits, and that this doesn’t matter for science to work. The reason outright fraud was so rare is that it harms knowledge and ALL of the knowledge workers. As a scientist, you need to trust that what comes before works because, as important as reproducibility is, very few have the time to spend reproducing earlier results. We must trust results. They can be flawed, but they must be honest.

But, why was this enough? Well, his model of how the scientific process in the long run accumulates more knowledge, despite being done by flawed human beings, is one of replication and selection: An evolutionary process. Each scientist wants their ideas to spread, to replicate, to be selected, and one of the mechanisms for this is credit. I have a good idea, I test it and publish. You build on it, and give me credit for the good idea.

If I put out an idea based on faked results, my ideas will be selected against, rather swiftly, once found out. That is, you’re dead. Would any of you cite Stapel? Even his non-indicted papers? How about Marc Hauser? Do we really really know about Föster. Would you cite without careful scrutiny?

At the time Hull was writing this (the book was published in 1988), science was, perhaps, smaller. His test-groups were two branches of classification scientists – those that work on finding ways on how to classify species of animals, plants, protists and the likes. The two groups he followed seemed somewhat intimate, and entangled in discussion. The work was published in this one journal, where about 60% of papers sent in were published.* Many of those 40% not published was because the authors never re-submitted. There was a great deal of scrutiny. A faker might very well be discovered early on, and would be out of the science pool.

Stealing, he, claims, was tolerated (as in plagiarizing and appropriating other people’s ideas), because it only hurt the individual stolen from. Fraud hurts everybody.

The fact that Fraud hurts a sizeable proportion of scientists and science is still true, of course (as does the less than robust science, which perhaps is behind the accusations of p-hacking, but not behind sloppiness with pre-registrations).

So what has happened, if anything?**

As I, and many others before me, have pointed out, Science is now a huge enterprise which overproduces scientists. This makes the competition for slots to get to do science that much more fierce – in true evolutionary manner. Evolutionary processes filters for fittest something, but whether this something also coincides with what humans considers good (in this case, increased true knowledge) is not guaranteed at all. Evolution, in its tritest is whatever survives survives.

Towards the end of his book, Hull asks a number of questions that are outstanding from his evolutionary model. One of them is – what happens if competition sharpens? Competition has always been a part of science, but Hull also spends a great deal of time demonstrating how important cooperation is for science to function well, and for science to produce more and more reliable knowledge. Citation is the minimum of cooperation – all of us need to rely on the work of other scientists in order to advance our ideas, and we need to acknowledge their work. But he goes further, demonstrating that you need cooperative allies – Demes. You may not all agree, but usually there is some idea or concept that you agree upon, and that you are all working on, and that you have a similar view on. This could be Darwinism or Cladistics (from his book). It could also be Social Priming, Persuasion, Emotion, what have you. There can be skirmishes, where one group – Deme – marshals evidence for their idea against the ideas of another group (Categorical vs Dimensional concept of emotion, Cladistics vs. Phenetics; Darwinism vs. Idealism – the latter two from Hull). This arguing can be fruitful, and in itself advances science. Having allies is important. Hull demonstrates quite well that ideas that only have single proponent or proponents who cannot cooperate don’t do well for survival of that particular idea.

Hull also mentions, towards the end of the book, that career concerns (rarely mentioned, but of course mattering) tended to align with the more vocal concerns about getting the science as right as possible. Doing good science in a productive deme got you published and cited more, and could be transformed into better career opportunities and resources for continuing driving the idea forward.

Perhaps it is here things have broken down, in the increased competitiveness – I think Shauna Gordon-McKeon’s “When science selects for fraud” lays this out very well. Career concerns are no longer as well aligned with good science. In fact, it can interfere with it, as has been discussed over and over again in various blogs. (Both Jelte Wicherts and Brian Nosek brought that up in the “beyond questionable science” symposium. Worth a second look here).

So, together, the sheer size, lack of good demes and competitiveness can have diluted how the processes in science effectively select against fraud and cheats.

Honest signals, and their faking.

If you look at game theory/evolutionary models on how trust can be maintained there must be some means for the cooperative individuals to protect against the untrustworthy (inspection), and some means to make it costlier to cheat (e.g. damaged reputation). I mused on a model based on Robert Frank’s emotion model in this blog post, but there is plenty of work looking at how to dis-incentivize cheating.

Concern about reputation (as gossip and reputation is a way to keep cheating in check) is one route towards maintaining trust. In science that would be having a reputation as a good hones scientist. *** But, reputation can be gamed. In my marketing psychology course, based on Cialdini’s “Influence” we discuss how authority can be coopted through, for example, clothing or titles. When the field is large and impersonal, as most scientific fields are now, the indicators may be very much removed from actual performance – indicators like number of publications in which papers with what amount of citations – and here journals are also working on maintaining their reputation by perhaps being known for flashy discoveries, or high rejection rates, none of which necessarily correlates highly with increasing actual knowledge (as perhaps the high retraction rate from the glam magazines indicate. Lots of work has been done on this). Publication, journals, and citation are then not necessarily honest signals for high quality, but sometimes, like the king snake or the cuckoo, mimicry.

Routine inspection (peer-review), is somewhat costly, but should be a way of ferreting out at least some of the cheaters. But, a surprising number of papers have been through peer-review where problems were not discovered. Perhaps, as Frank suggested, inspection got lax because scientists generally trusted that the other scientists were honest. The larger amount of honest cooperator, the less time is needed to devote to inspection (which then can be devoted to other, more productive activities).

When the fields are huge, there is not enough nearness to the agents in order to verify and inspect. What rises to the top may not be who does solid work, but who can project well – possibly a kind of narcissism.

I don’t know how to restore trust. But, the ease of establishing social connections via twitter and blogs may make it easier for us to share what doesn’t work, so we don’t end up like this poor bug (thanks to Felicia Felisberti who tweeted it in).

Efforts to do post-publication peer-review also allow more public scrutiny of results from scientists both friendly and unfriendly towards those ideas. (Friendliness is not a requirement. If you are against a theory you may be more likely to find its holes than if you love it. Hull lifts up that kind friendliness is not a requirement for science to go forward, as much as some of us would like it to be so). And, perhaps lifting up how incredibly important cooperation and collaboration is. Competition has its points, but when you use that as the only gage, you get the Lance Armstrong effect. One can argue about the goodness or badness about that in sports, which I tend to think of as trivial. It is not trivial when your ostensible goal is to increase our knowledge about the world.

*(there is a whole chapter analyzing who is accepting papers from which group to specifically investigate if there were obvious biases against the opposite camp. Conclusion – not really),

**I’m making the assumption that there is an increase in fraud. There certainly has been an increase in less than robust science. Feel free to contest.

***According to Hull there are a couple of other issues involved here, which has to do whether one choose to do solid but not very exciting research or risky research. Plodding puzzle solving is low risk, and a way of maintaining a solid reputation as trustworthy. Taking more risks could either result in a very high reputation if the research pans out, but one risks taking a big hit to reputation if it doesn’t, or if it too frequently turns out that the exciting research is not robust. This is entirely with the assumption that both the plodding and the risky work is done honestly.

****I have adopted Simine Vazires footnotes.

On Null results, refined.

The other day, JP de Ruiter tweeted in:


He has a point.

And, well, we do not want to use the sleight of stats Keith Laws suggests.


Which, as this post that just precedes this one shows, I have been pondering before, and I’m far from the only one pondering this. (Hey, it is my blog. I get to repeat myself. I think I’m sketching….)

Unlike Animal Farm animals, all studies with null results are not created equal. All of us know the standard reason why null-results are not published passed down through the training generations: There are many reasons why a study doesn’t work out, and a lot of them are scientifically entirely uninteresting. The uninteresting ranges from poorly thought through methods, badly chosen stimuli, errors in timing, badly run studies, crappy conceptualization, like those unhappy families though terribly uninteresting to write tomes about. This is what we remind our students of when they with feeble hope pipe up that it is really interesting to know what doesn’t work.

Sure. But the universe of” doesn’t work” is endless. Only things that don’t work in interesting ways are informative. Which, well, raises the question, what is an interesting way?

I know of two papers that published null-results prior to the replication flurry. On one, my advisor was a co-author along with June Tangney and others on certain aspects of Higgins Self-Discrepancy theory. The second was work by Jari Hietanen where he looked at whether the emotional expression of a centered face with eyes pointing to either direction in an attention paradigm (bear with me) mattered. That is, are we more likely to be lured by the eye-direction of a frightened face (as evidenced by faster reaction times when the target is in the direction, and slower when the target is in the opposite direction) than other emotional expressions?. He didn’t find that in 5 different experiments, using different depictions of faces. Both involved multiple studies and multiple variants of stimuli and paradigms. Tangney’s et al also included an alternative prediction. Lots of work. Perfectly reasonable. Rarely seen.

But, there are a lot of other types of null results.

Across the street from where I work, there is a museum called “skissernas museum” – the museum of sketches. It is filled with earlier drafts, sketches, and preliminary models of artwork that are officially displayed in museums, or as sculptures in squares, and in some cases well known.

A piece of art is not created from blank thoughts to the finished product in one go. Before are the sketches, the attempts, the miniature models. Even I, in my feeble amateur painting spent a bit of time sketching.

This is how I think about my spiders and snakes and attention (insert Oh My here) work, which has yet to see the light of day. We got something in each study, but could not interpret it. So, we kept tweaking them. Changing a thing here or a thing there. Alas, I left for Sweden before we had a tweak that gave us clear results.

A lot of the filedrawer may be just this kind of work. Sketches. Drafts. Preliminary work.

Some are more like our tweaking of a Stapel Ebbinghaus Study (as far as I know based on genuine data) where instead of social categories we used emotional expression. The non-results of that one probably lingers comfortably in that file-drawer, or land-fill as is the case now (as I emptied the drawers out myself). We gave it a good try, didn’t work, oh well, it was a bit of a long shot (although I have seen it done lately. Gasp).

Then there are those that may be informative in different way. I think the five variants of testing whether emotional state influenced perceptual processing of emotion-congruent faces might have deserved a null-publish. We thought it might work, it didn’t, and we had some ideas why (and, also as a warning, don’t waste your time doing this.)

And, then the even more troubling kinds– when researchers have attempted to replicate fairly directly some interesting effect that has already been published, and not getting it.

Pre-registration takes care of some of that, but that is for fairly late in the game. Here things are well thought out, and one can make a full-blown hypothesis testing that may or may not work out, and people are willing to bet both time and money on setting it up. But, not all of the attempts are of that kind.

These last couple of types are the ones that are missing, and that would be informative for research

But the rest? The sketches? And all those attempts that find no results because of reasons that has nothing to do with what is tested, but everything to do with the performance (and one has to remember that we likely all make these kinds of mistakes on the way, where the problems with stimuli, with collection, with design, and thinking things through which is only evident in hind-sight). What to do with them? Not all are strategic cases where you run a lot of studies and publish what “worked”. They just didn’t.

Publish? As if the literature isn’t crowded enough as it is. Even Skissernas Museum limit themselves to fairly late prototypes and sketches.

Paul Meehl suggested that it might be a good idea to have some place summarizing the pilot work that didn’t work out, in order for others to not go down that particular wrong turn. (Some turns are just so attractive that we may go down there multiple times, just to find it is a dead end).

For some areas that may be very interesting to formalize. But keeping it all may be like insisting on plastering every scribble of your kids daycare work on the wall.

Perhaps one of the issues also is that the criteria for publishing has been too lenient, or that the methods for determining what is real (aka null-hypothesis testing) is just too weak. Yes, I know, lots of people think that, and have said that for a long time! (I just re-read Meehls paper on Sir Karl and Sir Ronald where he chides hypothesis testing for being much too light of a challenge for a hypothesis. Put them to risk!).

*Yeah, I realize I covered this in my earlier post too. But it is my blog so I get to repeat myself if I want to. Perhaps I’m sketching.

Musings on what should be published

I just reviewed a paper that wasn’t stupid, and asked an important question. It is just that it was thin, and a null-result. It used 80 participants in 4 cells and it wasn’t repeated measures. They replicated (weakly) one finding, but found no effect for what most likely was what they really were going for.

I’m getting very sensitive to the file-drawer problem. If we have sensible data, should it languish? Yet, there is a problem cluttering up the journals with short, underpowered studies.

I left it up to the editor (who is my colleague) to reject it.

What I would have wanted to see was, first, better power. Then, follow-up work on the particular question.

But, this makes me think about publishing policy. I really understand the desire to publish things that “work”, (except that the indication of what works are so weak in psychology). It is like you want to unveil the final sculpture, the polished version of the violin concerto, the bug-free version of the software – not all the sketches and wrong steps and other discards on the way. You want to publish a real Finding – even if (as in all research) it is tentative.

But, the sketches, and wrong turns, and pilots, and honing have some kind of information. At least sometimes it is really important to know what doesn’t work. And, as was evident from the special issue on replication, there is work out there that people informally know does not work, but is not in the public record because the failure to replicate has not been published.

We had a brief discussion about this at last years “solid science” meeting. Joe Simmons said that there really are loads of piloting of ideas that turned out to be crap that really don’t need to be cluttering up cyberspace and our ability to navigate information, whereas Jelte Wichert’s thought it is really important to have a data-record.

I’m very ambivalent. There is so much data collected – I’m thinking of a lot of final theses that are done – where the research is the equivalent of arts and crafts projects that show that you can do this, but doesn’t really add to the research record.

Or, all those pilots that you do to tweak your instruments and methods. What to do with those? Meehl, in his theory of science videos, suggested that you collect that info in short communications, just for the record.

I’m thinking of two file-drawers I have. One of them really demonstrates that the phenomenon we were testing doesn’t exist. It is a boundary condition. As such, it might have been important to have it out there (5 studies, 90 people in 3 conditions in each, repeated measures). I have another set of 9 studies looking at threat and attention which are more of the “tweak the paradigm” type. Something happened, but it was terribly messy to interpret, and thus we were working on finding an angle where results could be more clear and interpretable. How do you make that distinction?

I have some idea here that it would be nice if one could spend that time with the sketches. Once it works, one needs to replicate, and one only publishes when one feels fairly certain that there is something there (and possibly include links to the sketches). Which, of course, is not how it is done right now, because of the incentive structure.



Social Psychology Replication – Special Issue.

The first time I met Daniël Lakens, he and Brian Nosek were working on a special issue in Social Psychology, calling for registered replication of well-known, highly cited studies.

It is now out! 15 articles of attempts to replicate with, let us say, mixed results.

I’m linking in the PDF as they posted it on the OSF framework, so you get both the text, and more exposure to the framework for your future collaboration efforts!

Some people, Science reports, don’t like being replicated, at least when the results are different. I’m thinking, once things are out there in the record, work really is up for being replicated or questioned. I thought that was the point! Maybe, once this is done more regularly, people adapt and won’t go all drama. Exposure therapy, I believe, have evidence on its side.

Chris Chambers, who has long been at the forefront of the call for registered reports (and implemented it at Cortex) has a more uniformly positive of the practice here.

I have done a first skim-through, and clearly, clearly we need to put a lot more effort into replicating results, march slower, be careful with what we accept.





Signs of reliable and unreliable research, reason and persuasion, and an RIP.

I thought I had more cool posts to share, but I got so wrapped up in the Baumol disease I got discombobulated.

But, yes, plenty more of good posts to share, so I’m sharing them now.

About a week ago, my bud Daniel Lakens reported on this Find on his blog. A paper even older than me! Yes, people have been thinking about these issues for a long time.

Sylvia McLain asks if Spotting Bad Science really is as easy as a nice poster giving instructions on how to do it. And, of course, if it really were, there wouldn’t be as much bad science. But, as a handy dandy tool it can be a useful beginning tool. The creator of the poster answers in the comments, and there is a good conversation.

Speaking of Bad Science, JP de Ruiter linked in a Brain Pickings article highlighting Carl Sagans baloney detector kit (got that?)

Tom Stafford linked in his draft of this very lovely article on rational argument. He brings up both Cialdini, and argument as a means of persuasion rather than correctness. As it is draft 2, it may evolve further, but I thought it was just great.

Last month, Keith Laws and others debated whether CBT for psychosis had been oversold. It was all filmed, so you can check it here (as I watched it at the same time as I was reading about montage and cutting techniques, I found myself wishing for some of those, plus a good sound engineer, but you can’t have everything). A Storify from Alex Langford appears here. I considered it a good example how a good anecdote trumps good data as far as persuasion goes – which ties in with the Cialdini in Tom Stafford’s piece, but I’m not a clinician. Worth checking out though.

Last, I was very sad to hear that Seth Roberts died. I’ve followed his blog for a few years by now, and I thought him very interesting, innovative and thoughtful (I even posted on his comments once, regarding all this Stapel fraud stuff, as he has been involved in that).



Get every new post delivered to your Inbox.

Join 999 other followers