UMI collapse error

Hello flow team,

I am trying to run the CLIP-seq pipeline on a file that contains two replicates.
All the steps run fine up until the UMI_collapse stage for the “multi.Aligned.toGenome_sorted.out.bam” files for each of the replicates.

I had a look at the process execution log and I think there is a Java memory issue:
“Exception in thread “main” java.lang.OutOfMemoryError: Java heap space: failed reallocation of scalar replaced objects”

Is this the case? if so, would there be an option to override this so the files can be processed ok?

Thank you and best wishes,
Fiona

Hi Fiona,

I see - this might be because the multi-mapped BAM is very large.

I just pushed an update to CLIP-Seq v1.1 that might help here so you can try it again, but also if you don’t require the multi-mapped files then you can ignore the error and everything else should be produced fine. Is this the case, or are other steps not being run?

There are other things we can try to help get these large BAMs through this step, so let me know if this happens on running for this file again.

One other thing - you say a file that contains two replicates - have you demultiplexed this file first? I’m not clear what you mean by this.

All best!
Charlotte

Hi Charlotte,

Thank you for your swift reply. I will send it through again now and hope the update helps!
Annoyingly it is happening to both the multimapped (which I don’t need) stage and this stage (which I do): CLIPSEQ:GENOME_UNIQUE_DEDUP:UMICOLLAPSE

And yes the sample has been demultiplexed. I meant that prior to demultiplexing the fastq uploaded contained 2 replicates that were successfully demultiplexed. I then added both of these to the CLIP-seq execution. I was wondering if running each replicate as a seperate CLIP-seq execution could also help?

Thank you!
Fiona

Hi Fiona

Running them separately shouldn’t help so do let me know if the update helps, otherwise we will look at other solutions!

Best,
Charlotte

Hi Charlotte,
Unfortunately it is still failing at the same steps (and not the filter transcripts step too) so maybe the files are too large.

We were thinking we could split the fastq in half or something but if there is an alternative through flow we could try that would be great?

Thank you,
Fiona

Hi,

Sorry I introduced an error with the filter transcripts trying to fix something else, that part should be fine now.

On the large BAM failing we can expose other UMI collapse methods via the user input so you could try a simpler algorithm that’s less RAM intensive. If you’re able to share one of the failing samples with me (on Flow) I can also try to run it on HPC to play with the settings to see what works, then implement this on Flow.

Best
Charlotte

No worries at all, hope it was easily fixed!

I have just shared the project with you on flow, so hopefully you can now see it to edit etc - let me know if not. It is called “tech_hnRNPC”.
Thank you for investigating this!
Fiona

Hi Fiona

I’ve released a new clipseq pipeline version 1.2 with an updated UMICOLLAPSE module from nf-core that deals with high memory requirements better - with this update the uniquely mapped BAM file is deduplicated but still the multimapped fails. PEKA and iCount peak calling are still running, these tools are both quite complex computationally and weren’t designed for such large datasets, but I’ll keep watching - maybe they’ll finish! you can see here: Flow

I’d note that your files are exceptionally large- 100M reads is very deep for CLIP and the UMI length is also longer than anything we’ve processed before, both of these factors impact the compute dramatically. It might be that further optimisations are required, we’d be happy to implement them once you have them figured out.

Best,
Charlotte

Hi Charlotte,
Great thank you. It looks like it has processed everything except the multimapping data, which we don’t technically need for our analysis.
Yes I agree these files are far greater than most CLIP datasets and I doubt we will need to process such large files again - this is an outlier in our analysis. I think it may have been better to split the fastq into multiple parts first…
Thank you for your help!
Fiona

Hi

That’s good to hear that you have what you need. The problem with splitting the fastq in the first case is that you might be splitting up PCR duplicates into separate files that need to be collapsed together. The BAMs could be split up into chromosomes for example after they’ve been mapped, because PCR duplicates are collapsed partly based on position of mapping.

All best,
Charlotte

Hi Charlotte,
This is a very good point - we did not think of this!
Ok maybe next time we can map manually and split bams…
I’m sure theres a way around things,

Best wishes,
Fiona

1 Like