Normalization/Log2 transformation requirements

This issue has been tracked since 2021-10-05.

Hello and thanks for developing this model. I read in the supplemental materials that the G x S matrix for RNAseq data should be filtered for low counts, normalized, and also log2 transformed before running the model. It also gives RPKM and TPM as suggestions for the normalization, however I would like to use upper-quantile normalized counts generated by RUVg so I can include my use my spike-ins easily. Will this be a problem? So far I have filtered low count genes and extracted the normalized counts from RUVg, log2 transformed them, and rounded so they are integers. I want to be sure I am understanding correctly and that my normalization procedure checks out (and also that I'm not over-normalizing).

Thanks!

davidsebfischer wrote this answer on 2021-10-06

Hi @mallorymaynes , ImpulseDE2 uses a negative binomial noise model which comes with assumptions on data distribution and is built for count (ie non-normalised, non-logged, integer) data. This type of statistical modelling still works if your data transform does not validate the count data structure too much, log-ing will cause major issues most likely, for example.

Assuming that your transforms dont change the statistics too much, it may work, it would be better to use count data and to supply size factors for scale the model. Filtering genes does not affect the model fits of the other genes if you define size factors.

mallorymaynes wrote this answer on 2021-10-06

Thank you, this is very helpful. It sounds like I should instead use my raw counts and include the estimated factors of unwanted variation generated by RUVg - is that what you mean by supplying factors to scale the model?

mallorymaynes wrote this answer on 2021-10-08

Hi David, I am still a little confused about how to input my RUVseq factors of unwanted variation into ImpulseDE2. Specifically, the output for RUVseq (called "W_1") is used as a covariate in DESeq2 or edgeR models, such that the full model for a time course in DESeq2 would be "~ W_1 + time + treatment + treatment:time," and the reduced would be: "~ W_1 + treatment + time." Given this, how do I correctly integrate W_1 into ImpulseDE2? Would this be considered vecConfounders, size factors, or something I can integrate in the dfAnnotation? Thanks for your help, it is much appreciated!

davidsebfischer wrote this answer on 2021-10-14

This would be an element of vecConfounders, which essentially build a model that works like the "+" nomenclature in DESeq!

More Details About Repo
Owner Name YosefLab
Repo Name ImpulseDE2
Full Name YosefLab/ImpulseDE2
Language R
Created Date 2016-04-06
Updated Date 2021-10-15
Star Count 21
Watcher Count 7
Fork Count 5
Issue Count 20

YOU MAY BE INTERESTED

Issue Title Created Date Updated Date