UMI collapse error

FionaHaward · 23 April 2024 14:55

Hello flow team,

I am trying to run the CLIP-seq pipeline on a file that contains two replicates.
All the steps run fine up until the UMI_collapse stage for the “multi.Aligned.toGenome_sorted.out.bam” files for each of the replicates.

I had a look at the process execution log and I think there is a Java memory issue:
“Exception in thread “main” java.lang.OutOfMemoryError: Java heap space: failed reallocation of scalar replaced objects”

Is this the case? if so, would there be an option to override this so the files can be processed ok?

Thank you and best wishes,
Fiona

Charlotte · 23 April 2024 15:58

Hi Fiona,

I see - this might be because the multi-mapped BAM is very large.

I just pushed an update to CLIP-Seq v1.1 that might help here so you can try it again, but also if you don’t require the multi-mapped files then you can ignore the error and everything else should be produced fine. Is this the case, or are other steps not being run?

There are other things we can try to help get these large BAMs through this step, so let me know if this happens on running for this file again.

One other thing - you say a file that contains two replicates - have you demultiplexed this file first? I’m not clear what you mean by this.

All best!
Charlotte

FionaHaward · 23 April 2024 16:27

Hi Charlotte,

Thank you for your swift reply. I will send it through again now and hope the update helps!
Annoyingly it is happening to both the multimapped (which I don’t need) stage and this stage (which I do): CLIPSEQ:GENOME_UNIQUE_DEDUP:UMICOLLAPSE

And yes the sample has been demultiplexed. I meant that prior to demultiplexing the fastq uploaded contained 2 replicates that were successfully demultiplexed. I then added both of these to the CLIP-seq execution. I was wondering if running each replicate as a seperate CLIP-seq execution could also help?

Thank you!
Fiona

Charlotte · 23 April 2024 17:40

Hi Fiona

Running them separately shouldn’t help so do let me know if the update helps, otherwise we will look at other solutions!

Best,
Charlotte

FionaHaward · 24 April 2024 09:38

Hi Charlotte,
Unfortunately it is still failing at the same steps (and not the filter transcripts step too) so maybe the files are too large.

We were thinking we could split the fastq in half or something but if there is an alternative through flow we could try that would be great?

Thank you,
Fiona

Charlotte · 24 April 2024 15:24

Hi,

Sorry I introduced an error with the filter transcripts trying to fix something else, that part should be fine now.

On the large BAM failing we can expose other UMI collapse methods via the user input so you could try a simpler algorithm that’s less RAM intensive. If you’re able to share one of the failing samples with me (on Flow) I can also try to run it on HPC to play with the settings to see what works, then implement this on Flow.

Best
Charlotte

FionaHaward · 24 April 2024 15:55

No worries at all, hope it was easily fixed!

I have just shared the project with you on flow, so hopefully you can now see it to edit etc - let me know if not. It is called “tech_hnRNPC”.
Thank you for investigating this!
Fiona

Charlotte · 29 April 2024 15:45

Hi Fiona

I’ve released a new clipseq pipeline version 1.2 with an updated UMICOLLAPSE module from nf-core that deals with high memory requirements better - with this update the uniquely mapped BAM file is deduplicated but still the multimapped fails. PEKA and iCount peak calling are still running, these tools are both quite complex computationally and weren’t designed for such large datasets, but I’ll keep watching - maybe they’ll finish! you can see here: Flow

I’d note that your files are exceptionally large- 100M reads is very deep for CLIP and the UMI length is also longer than anything we’ve processed before, both of these factors impact the compute dramatically. It might be that further optimisations are required, we’d be happy to implement them once you have them figured out.

Best,
Charlotte

FionaHaward · 30 April 2024 09:14

Hi Charlotte,
Great thank you. It looks like it has processed everything except the multimapping data, which we don’t technically need for our analysis.
Yes I agree these files are far greater than most CLIP datasets and I doubt we will need to process such large files again - this is an outlier in our analysis. I think it may have been better to split the fastq into multiple parts first…
Thank you for your help!
Fiona

Charlotte · 30 April 2024 10:00

Hi

That’s good to hear that you have what you need. The problem with splitting the fastq in the first case is that you might be splitting up PCR duplicates into separate files that need to be collapsed together. The BAMs could be split up into chromosomes for example after they’ve been mapped, because PCR duplicates are collapsed partly based on position of mapping.

All best,
Charlotte

FionaHaward · 30 April 2024 10:14

Hi Charlotte,
This is a very good point - we did not think of this!
Ok maybe next time we can map manually and split bams…
I’m sure theres a way around things,

Best wishes,
Fiona

Topic		Replies	Views
CLIP-Seq version 1.2 Announcements	0	47	30 April 2024
Demultiplexing error - Ultraplex Pipelines	25	174	17 May 2024
STAR_ALIGN error CLIP	1	54	29 April 2024
Files not visible in Demultiplex Pipeline Demultiplexing	1	16	6 January 2025
CLIP-Seq version 1.4 Announcements	0	17	27 February 2025

UMI collapse error

Related topics