Making psychology Solid, Robust, perhaps even… Antifragile? Report from The Solid Psychology Symposium.
I went to another interesting symposium on how to make psychology (perhaps social psychology) stronger and more reliable or valid. Aranged by the Radboud University Nijmegen and Tilburg University. I echo Alan Fiske – thanks Stapel for providing us with this wonderful opportunity! You can read a storify of it here, and go read Rolf Zwaan’s write-up also.
First some of the Social. – I had a nice time hanging out with Daniel Lakens. Met and chatted with Rolf Zwaan, and his wife. Afterwards, at beer (nice, warm outdoors), I had good chats with both Leif Nelson and Travis Proulx, along with other speakers and participants.
But, back to the talks. First up, in the morning, was Uri Simonsohn, talking about how to think about what non-replicating means; Leif Nelson talking about p-hacking and the p-curve, and Joe Simmons on Power. What are the takehomes here? For one, engage with the results and what they mean.
Beginning with Uri Simonsohn, non-replication cannot just be not significant (n.s.). Because n.s. just doesn’t tell us much (remember Cummings dance of the p-values). Something can be non-significant because it is underpowered. Did you then not replicate? Simonsohn suggested there is an incentive to underpower. I’m not sure. I’d think that would very much depend on what you think of the results you are trying to replicate are in the first place. I can see an incentive towards underpowering new research where you have limited resources. But, when replicating (unless you are holding an unseemly grudge, and by now we’d figure you out anyway) you’d want to give the phenomenon a fair chance. Otherwise, why bother? Even if you are convinced it is a spurious result.
Uri used several examples to illustrate his points. The one that stuck out for me, as I actually teach this, is the weather-life satisfaction result from Schwartz & Clore. For those not familiar (not everybody lives in my emotion bubble), S&C tested their “mood as information” theory, by calling up people on rainy or sunny days (in the Chicago area) and asking them to rate their life-satisfaction. And, crossed with the weather conditions, in half of the calls the caller starts early with asking what the weather is like. The results: on rainy days, when the responder has gotten the weather called to his or her attention, the score on life-satisfaction is lower than in all other conditions. The thought here is that rainy weather makes you less happy, but if you are not paying attention to why your mood may be low, it spills over into other affective judgments – you use your mood as information. N (each cell) was 14. But, a recent look at weather and life-satisfaction over a very long time and lots of people, in Australia, really doesn’t show much of an effect of weather on life-satisfaction. It was done rather differently, of course. They took advantage of a large set of data of life-satisfaction ratings, and the record of weather conditions.
His suggestion, in the end, if we must use some rule-of thumb, is too look at whether the effect size of the replication is smaller than a small effect size. But, in some ways, I see the bigger take home that there are no really easy demarcations (ah, theory of science peeking up again). Look at the data and the results.
Leif Nelson’s p-curve talk got me wanting to try looking at the p-curves for some of the research that I am pursuing. Of course, p-hacking is what lots of us basically were trained to do, to handle noisy, disappointing, or plain weird data. But, true effects and null effects have predictable p-distributions. Nulls have a uniform distribution. P-curves for true effects will be skewed (left skewed? Bump is to the left, tail reaching out to the right). P-hacked curves have the opposite skew. You can look at the distributions, and make a kind of rough meta-analysis this way – which he demonstrated. He used the topic of choice. Are more options better, or do they impede selection? There is evidence for both. Meta-analyses actually don’t seem to say much, and he showed the effect-size tree. But, looking at the p-values of those that indicate “more is better” showed the non-p-hacked curve. Those indicating more is worse looked, actually, much more flat. Hmmm. I will look up and read their paper and use their tool (link above is to the post here where I linked it earlier. Here is his paper).
Joe Simmons talked more about power, and how to behave now that we no longer p-hack. First, he demonstrated what we can detect with the type of n’s per group that has been fairly standard (20). And, really, the ones that can be detected are those that are trivial. I’ve heard a summary of this talk before from one of the Prof’s here, but species typical things like men are on average taller than women you detect with an n of 6. But, things like liberals believing in egalitarianism more than conservatives takes a much larger number of n’s to reliably detect, and as he pointed out, this is one of the defining features. It is funny, but very illustrative, and makes it very clear how much you should distrust low n studies of subtle and complex social psychology phenomena.
(As an aside – and I have been thinking about this for a while, without having bothered/had time to get up and check it out – the power they talked about here are between two groups. But, a lot of work is done within subjects, with multiple presentations, which should increase power, but introduces their own issues with stimuli affecting response of the next stimulus – something that has been demonstrated by researchers like Russell looking at expression judgment both as clean versions (judging just one) and as demonstrating the contrasting away by prior expressions (Fehr and Russell), Brad Gibson, testing bottom-up attention by half way through throwing in a new feature, seeing if the participants discovered it (nope) as well as Derryberry, looking at the immediate phasic emotion response to positive and negative feedback on the next piece of stimuli. Power calculations need to be considered for this also. Some kind of rule of thumb that is a bit better than what is going on now).
Back to Joe. He suggests to shoot for an n=50 as a minimum for these types of experiments, to reliably detect whether an effect is there. But, he also brought up the question about what to do with data-sets where the main hypothesis did not fall out. Just throw out? No. Exploring in itself is not bad (as long as you are honest about it when reporting). He brought up an experiment that seemed like an anchoring experiment to me (estimating different quantities after having been given a high and a low anchor). Didn’t work out. But, once you started looking at the data, you could begin to understand why things did not work out this time. First item (estimating weight of lion) had enormous variations. A likely explanation being that people were clueless anyway, so the anchors were not able to show a difference. The second, a jar of coins, included different denominations, and the instructions were ambiguous enough that perhaps some people thought it was the worth of the money, and others the number of coins. The third item was better, and once some outliers were removed, looked significant. But, we are now post p-hacking. What to do? Replicate with the item that seemed to work. Learn! Tinker.
(A later discussion ensued on whether to disclose the lion and coin-jar issues. They are interesting, I think, but Joe also made the important point that there is so much of this sketch-work, and titrating that if all was reported, the literature would become even more impossible than it is right now. But, I do think it would be interesting having at least a partial record of problems and why they are problems. Kind of a scientific out-takes or bloopers. Or museum of sketches. Both to consult: what has been done that was shown didn’t work and why. And, also for training. Showing the craft that goes into making the research work).
After the third talk, there were a short set of pitches for improving science projects. First up, Jelte Wicherts, and his new journal Open Psychology Data. Second Mark Brandt presented the Open Science Framework (which I have looked at, but not quite started using yet, even with good intentions. I had too many students. But, perhaps I’ll do some uploading of data now that we are slowing down). Finally Hans Ijzerman talked about how they are working on storing data. This is really something we should pursue at my university also. One thing is that there is plenty of ERP data, and it is big – it needs good storage facilities. But, also, we are becoming a much more productive department, and we need to keep good records of what we do.
This was an intense, but interesting morning. Lunch was welcome, and I was happy to meet Rolf Zwaan face to face.
Afternoon began with Philippe Schyns demonstrating data-driven techniques for doing research. I’ve read some of Philippes work, as he has done things on facial expressions, and I have looked at the papers produced by the work he presented. Their aim was to try to find the templates or representations that people have in their brains (I can hear Andrew & Sabrina protesting as I write this. But, hey, let a lot of methods bloom). To do so, they test individuals. Not several individuals, single individuals. And, they collect 30,000 datapoints from them. Part of this work that resembles what was done in the Townsend lab I was affiliated with especially for my dissertation work with my 30,000 datapoints (spread across a few configurations), 12 individuals, but the data analyzed by individual. The aim is trying to recover an individual,s face template, or letter s template, or smile template.For the face-template, people were presented with random white noise (war of the ants we call it in Sweden), and are asked to say whether they see a face in it or not. (Like detecting a face in a cloud, or virgin mary in a piece of toast). Then, for each individual, one can use those deemed as having a face and recover a kind of face-template – what kind of stimulus would engage the face-detector. If you look at the supposedly “face detected” stimuli they just look random. (perhaps because you are not the individual, but likely also because they are not very face like). So, this is a way of trying to recover a face-template in a data-driven way. In subsequent work using ERP, they see responses to the more “face like” stimuli in the individual also. It is fascinating, and has its place. The technique also has the same drawback as my own work on gestalt perception of emotional faces. Emotion detection is quick and ongoing. The processing you can infer from data collected across 25 1 hr sessions is not likely to reflect this quick processing, but a strategic, learned processing. We were all very well aware of this, my adviser and my committee. These interesting processing patterns emerge in this huge data-set of reaction times – and it is a consistent pattern across my 12 subjects, and across the expression combinations that I used. But, does it say anything about how we orient real time to expressions? This really maps onto the Heisenberg uncertainty principle, which is about measurement. How to get to the gestalt/possibly strategic viewing in real time?
Oh, I enjoy listening to Klaus Fiedler. Really a wide-ranging intellect. The talk had similarities with his talk in February, but still differed in interesting ways. He quickly pointed out that the statistics is just a tool. Not THE thing in research. It is used to help us understand our results. Not the be all and end all. He is very much a proponent of the theory-driven approach, rather than an effect approach. He also emphasizes that we perhaps need to expand our theoretical ideas. If we find evidence for our pet theory, perhaps our pet-theory really is too narrow. There are less narrow alternative theories that may encompass it. His first example (which is one he also gave at Brussels, and which I believe is in his paper in perspective) is if you see a series, like 2, 4, 8, 16 – and you then theorize that the rule is doubling the previous (yes, it very well could be). But, it could also be simply increasing series of numbers. Or numbers, or symbols. There is a danger of being too focal. A second (more psychological) example was one that faces associated with cheating (cheating detection) are better remembered. But, it could also be that it is faces associated with negative emotion. Or any emotion. This needs to be tested! He also thinks, as he said before, that we need to make sure science can continue being creative. You may need to tighten research at times, but also expand it. (Go reed Rolf Zwaan’s post on that) Take some risks. His concern is with validity, not reliability. Are we testing what we think we are testing? Are we just pursuing effects, instead of theories.
Alan Fiske is, in some ways, not in his element, as it should be, as he is an anthropologist. He began his talk suggesting that psychologists are perhaps starting in the wrong end of things, with all the testing and experimenting, and instead perhaps should spend more time observing and cataloging. Psychology spends so much time experimenting on the easy to obtain participants – ourselves and our students, who are just about the strangest and most unusual humans on the planet ever. (Well, you know, the Weird thing). It is also the case that you become home blind in your own culture, because that is as it is. Fish in water as the saying goes. This is, of course, a problem if you want to find kind of universal mechanisms or a general understanding of the psyche. As he is in anthropology, where they routinely send their students off to cultures not their own, he has some command of human variation. He proceeded to mention a lot of interesting variation when it comes to marriage, and death etc. He also questioned the idea that when we look at psychological phenomena, we think we are getting to some kind of general processing – even though there are cultural differences and overlay. This may not be the case, perhaps depending on what you at. For some processing, there may very well be universals, but once you deal with social cognition, we are so attuned to the culture that we grow up in and are surrounded with, that we may very well be altered enough that what is found about processing in one culture may not be the case in another. That is empirical, of course, but suddenly makes it much more difficult to do proper psychology. And, yes, maybe we need to push this through. Research isn’t about number of papers produced. It is about discoveries. He suggested the Human Relations Area-files, for those of us who don’t have time reading ethnographies.
Travis Proulx, from Tilburg university, was the final speaker. And, I found out, he had left his university to specifically work for Stapel… Damn. And, I took the cheating personally…. As his foils – his examples, he uses Kuhn and the flashy social psychology marketing guru Rick Starfield. Hmmm. Having lived so many years in LA, I found him completely plausible, but I think he may have been a brain-child of Travis. (Oh, you live in the land of the absurd, nothing seems abnormal anymore). He starts with that black queen of hearts. Shows up in Kuhns early chapters also. Which you don’t notice, but may leave you with a vague sense of discomfort, that has consequences. Then, he keeps collecting and showing lots of different areas in social psychology, where similar effects happen, but they are all named different things, and not put together. The point of his talk – which he cleverly disguises – is that lets not do toothbrush science. Let’s work on normal science. Let’s put things together, rather than chasing new variants of phenomena, never thinking that they, perhaps, are the same thing. Let’s go back to normal science.
The overviews are from memory (mostly written in the airport waiting for the flight home). BUT – there will be videos. The whole thing was filmed, and as soon as they are up, I’ll make sure to link them here also.
After, we went off for beers and chats, and I had dinner with Daniel Lakens.
These symposia for changing the scientists practices have been great and needed.But, there is a need to move to the next step. One is the ground work for all of us to make sure we implement these new practices with our new students and with our colleagues. We also need to move the action to a different level of the system – that of journals, editors, granting agencies, hiring and promotion committees, because without altering how science is done there, it doesn’t matter what we do, because those that go from fast unreliable to slow and high-powered will be pushed out of the scientific gene-pool. As one of the speakers said – Good science takes time. Kahneman & Tversky got their Nobel based on so few papers that they would not have gotten tenure today. We are supposed to discover new things. That means you will go wrong a lot of times with you hypothesizing (Joe Simmons thought about 50% of his ideas were wrong). As I continue listening and reading my Taleb “antifragile”, doing science is akin to being an entrepreneur or restaurateur. You take risks. Most of it will fail. Perhaps we need to protect those that valiantly fail, so that others can succeed, and not honor those that seem too good and perfect to be true, like the Stapels and Lehrers and Schön’s of the world, because, really, nobody can be that good and productive and doing it right.