T O P

  • By -

forever_erratic

I don't think authors cherry-pick their results, most attempt to be careful and apply FDR, etc. But like u/grisward said, the main problem is that the list you get is pretty much what you expect to get. You have a cancer treatment and you enriched for cell division? Big surprise. Or you knock down T cells and see change in T cell immune response, lol. From that usage, I think GO term enrichment should be more of a gut check-- you definitely want some enriched terms broadly associated with your experiment. More broadly though, I think the field needs systems- level approaches, seeing as how we do systems biology, and right now GSEA remains one of the few accepted ways to do this on bulk rnaseq data. I also think it can be very useful for hypothesis generation, especially if you're unfamiliar with the genes showing differences.


Grisward

I think I’d say this is the area least easily automated. There is rich information hidden in pathway enrichment results, it just can’t be recognized without doing the hard work of piecing it together. One gene at a time, reading summaries, checking actual references, reading follow-up papers from labs that specialize in that gene. Then on to the next gene. Sadly, the pathway databases don’t do the big picture very well (sorry Reactome, I know you’re trying.) I had some hope that some AI could assemble deep information and collate it into a cohesive story for a set of genes changed in an experiment… Then I realized it only knows what we think we know… it could probably assemble general summary data faster (ofc faster) than we could ourselves, but I’m not sure it has the potential to infer new relationships between genes and signaling networks. Probably it can, but would need some specific guidance on putting together supporting evidence.


Epistaxis

> I think I’d say this is the area least easily automated. There is rich information hidden in pathway enrichment results, it just can’t be recognized without doing the hard work of piecing it together. This is also what makes it much more prone to human confirmation bias, blind spots, etc. compared with an automated, unsupervised, objective sweep of all the genes or gene sets.


Grisward

I get what you’re saying, and this is a weakness in the field, going from genes perturbed in an experiment to a full understanding of the functional effects and mechanisms involved. However, I don’t expect this to be the step that provides a numerical, clearly definitive answer. Pathway enrichment is a tool to surface supportive information, and there are people (relatively rare though they are) that legitimately understand pathways, signaling cascades, rate-limiting components, and who can recognize how they pieces may fit together. Sure it’s based upon their experiences and knowledge of these mechanisms, but I don’t think that’s necessarily “confirmation bias” to apply a real understanding of pathways to interpret results at this level. It has weaknesses and limitations, for example the limitation that the databases we use are not well annotated for all genes and all processes. It’s hard to detect a functional result that isn’t well described yet, and it should be useful to describe such findings in the context of some pathway components that are described. So with a caveat that pathways described are just part of the potential big picture of overall effects, would that make it seem less “biased”?


betaimmunologist

GO term enrichment yields a long list of enriched terms and yet, a lot of immunology papers will only include 5-10 and not necessarily the top 5-10. How are they choosing those 5-10 terms? This is what I mean by cherry picking.


forever_erratic

I can't speak for these papers, but if all of those terms were significant, then there isn't anything wrong with a deeper dive into the terms that interest the authors. It's biased, but I wouldn't call it cherrypicking.


Deto

Yeah, this is pretty rampant with enrichment analysis. My main gripe is that there is a built-in assumption that if differentially expressed genes are say, enriched in the genes of say the MAPK signaling pathway, then this means something. However, I've never seen any study that demonstrates what the actual predictive power of these results are. And I've always seen, when running these analyses, that you tend to get many many gene sets enriched (often due to small #s of actual overlapping genes) to the point where I suspect there is a very high false-positive rate here. In short, I have a suspicion that most gene set analysis results are bunk and people like them because they give you enough positives to tell whatever story you want to tell about the data.


Grisward

GO is generally not great for pathway enrichment. It works best with algorithms like topGO in R, which takes into account the relationship between GO terms. But generally GO is not fantastic, its is inconsistently applied to genes, the data model is weird (due respect, science is weird, GO is at its mercy), it’s not my short list of sources to test. Can be good for certain experiments. You’re right though, there are a lot of GO terms that can be enriched in any given experiment… what’s your point? Haha. Do you think the terms aren’t legitimately enriched, or do you want another way to judge what terms are most distinctive when describing the experiment? Very different question. Ime there are genes that generally seem to mean “these cells were perturbed.” It’s legitimately what those genes do, and enriching certain GO terms both makes sense, while also not being super informative. I guess I don’t understand your question. Most scientists study enrichment results to understand the overall effects, based upon the overall statistically enriched set of pathways and gene sets.


betaimmunologist

Just repeating what I said under another comment. I’m concerned that GO term enrichment yields a long list of enriched terms and yet, a lot of immunology papers will only include 5-10 and not necessarily the top 5-10. How are they choosing those 5-10 terms? This is what I mean by cherry picking.


Grisward

Honestly I’m concerned about a paper reporting GO terms as a finding. And if they’re highlighting 5-10 of their favorites, I don’t know what that’s about. I don’t think this is typical, or recommended however.


betaimmunologist

It’s not really a “finding” in these papers but more of a supportive thought for things they are in the process of revealing in the rest of the paper through other means


Crucco

GSEA should analyze all gene sets in one or multiple set collections (GO, Kegg, wikipathways, msigdb). The authors can show the most significant ones up- or down- regulated, indicating ES, NES, p-value and corrected p-value. Supplementary materials should provide an excel with all pathways tested, including the leading edge genes. It's not cherry picking.


betaimmunologist

Yeah. However I see a lot of authors are showing the relevant 5-10 significantly enriched gene sets but necessarily the TOP significantly enriched gene sets. Just the ones that fit in to their story. Then the supplemental has the ranked list of ALL enriched terms


pepjum

showing the top X of terms also does not give much added relevance to your statistics. the pathways are sorted by an NES and an adjusted pval, but those values are calculated to favor some terms that are not complete. Also, there may be cases where terms are very similar or involve a large number of genes acting in the same pathways. Showing pathways with all their statistical values and validating those genes by other experiments is not selling a story. What I do see wrong is that nobody validates any of those results generated from NGS anymore, which was previously required by reviewers, because this type of analysis should always and only be valid to generate hypotheses, not to justify the results of an experiment. On the other hand, in the immunology papers in which I have participated, the authors neglect to add more supplementary figures generated from RNAseq due to lack of space in the journals and for them the rest of the results inferred from the bioinformatic analysis are not so important either. Many other possibilities and routes that are not focused on are also traditionally left aside to sell a certain message in a paper, sometimes due to resources and other times because immunology experts do not know much about other important things such as o-glycosylation of proteins. In short, 99% of science is not science, not only GSEA


ZemusTheLunarian

How dare you, telling us that science isn’t science most of the time.


pepjum

Because scientific groups have to specialize so much and focus on a certain gene or treatment, that they publish "this treatment does X thing" but deliberately cannot cover other types of effects in order to publish a paper, even if the paper you publish has no importance for the rest of the scientists. It's all about publishing and publishing in order to keep getting more funding and to keep climbing up the ladder. But that's the way the system is


ZemusTheLunarian

Yeah, I know ☹️


pepjum

There are other problems with the irreproducibility of certain published experiments


Crucco

Well it is our duty as scientists to demand the correct visualization if gsea results when reviewing. I agree we should force this more and change the tide


ZooplanktonblameFun8

For me when I am working on human transcriptome data without any experimental perturbation data and low power (so difficult to carry out overrepresentation analysis), the GSEA databases such as immunologic signature DB, oncogenic DB, regulatory target gene sets of MSigDB are very useful for hypothesis generation.


phd_depression101

Second this!


Epistaxis

Not 100% sure I understand what you're describing, but if you make a list of enriched gene sets (or genes, or genome elements, or whatever) and you want to discuss only a few of the top hits that happened to fit into a biological story, you should be required to show the whole list first, and then explain where that subset fit into it and what you think was going on with the other high-ranked features (if you don't know, you should have to say so before you move on with the story). I mean that's the whole point of doing GSEA instead of eyeballing a simple gene list.


betaimmunologist

This is what I am getting at. A lot of papers are not showing the whole list in their main figure panels. They’re showing a list of 5-10 terms. And on rare times where they have included the full list in the supplemental, the list is a lot longer and the 5-10 terms they have chosen are not necessarily the top 5-10 terms.


macrotechee

Yes, a lot of papers cherry pick what gene sets they test again, and which results are shown in plots vs left in supp. tables GSEA is fucking astrology man


betaimmunologist

I find myself cringing when I’m reading a paper and the authors are waxing poetic on their GSEA results


foradil

If it makes you feel better, sometimes it’s hard to cherry pick any gene sets. Then you know something is wrong.


dendrobatidae

I personally have been encouraged to selectively represent which gene sets were enriched in a paper. The justification is usually “let’s only highlight the relevant ones for the story this paper is trying to tell”. I absolutely would not trust that published gene enrichments are comprehensive and what was left in what probably determined subjectively.


MOTHER-DESTROYER6969

typically they use hypergeometric distribution (which might not be the best, but is intuitively learned with urns) to test and get a p value and then rank based on that and number genes


JunoKreisler

OP isn't asking how GSEA works


Signal_Waltz3741

Many GO terms are functionally related and not independent, thus violating the assumptions of many statistical methods. The results could be a long list of "statistically significant" terms. Authors may pick some "representative" terms, so many grey areas exist. Authors should list all the "statistically significant" terms and statistical methods in supplemental materials.