Why isn't "stereotype threat" stronger in the data?

From a recent survey by Pennington, Heim, Levy, and Larkin:

This systematic literature review appraises critically the mediating variables of stereotype threat. A bibliographic search was conducted across electronic databases between 1995 and 2015. The search identified 45 experiments from 38 articles and 17 unique proposed mediators that were categorized into affective/subjective (n = 6), cognitive (n = 7) and motivational mechanisms (n = 4). Empirical support was accrued for mediators such as anxiety, negative thinking, and mind-wandering, which are suggested to co-opt working memory resources under stereotype threat. Other research points to the assertion that stereotype threatened individuals may be motivated to disconfirm negative stereotypes, which can have a paradoxical effect of hampering performance. However, stereotype threat appears to affect diverse social groups in different ways, with no one mediator providing unequivocal empirical support. Underpinned by the multi-threat framework, the discussion postulates that different forms of stereotype threat may be mediated by distinct mechanisms.

Or from Wikipedia:

Whether the effect occurs at all has also been questioned, with researchers failing to replicate the finding. Flore and Wicherts concluded the reported effect is small, but also that the field is inflated by publication bias. They argue that, correcting for this, the most likely true effect size is near zero (see meta-analytic plot, highlighting both the restriction of large effect to low-powered studies, and the plot asymmetry which occurs when publication bias is active).^[

Earlier meta-analyses reached similar conclusions. For instance, Ganley et al. (2013)^[10] examined stereotype threat on mathematics test performance. They report a series of 3 studies, with a total sample of 931 students. These included both childhood and adolescent subjects and three activation methods, ranging from implicit to explicit. While they found some evidence of gender differences in math, these occurred regardless of stereotype threat. Importantly, they found “no evidence that the mathematics performance of school-age girls was impacted by stereotype threat”. In addition, they report that evidence for stereotype threat in children appears to be subject to publication bias. The literature may reflect selective publication of false-positive effects in underpowered studies, where large, well-controlled studies find smaller or non-significant effects:

Personally, I find stereotype threat to be a very intuitive idea with a fair amount of anecdotal support. So why aren’t these meta-results more convincing?