Low unique reads count Arabidopsis CLIP-seq

RicardoFerraz · 15 October 2025 15:34

Dear Flow.Bio team,

I’m doing a CLIP-seq pipeline with Arabidopsis crosslinked (R1 and R2) and non-crosslinked (noUV_R1) samples and a sample without the specific antibody, where I used an unspecific IgG antibody instead (IgG_R1).

Firstly, I prepared the genome for the CLIP pipeline. I uploaded the genome .fasta sequence and .gtf annotation file and the ncRNA .fasta sequence:

In the .gtf file, I had to remove all the attributes that were not gene_id, transcript_id, exon_number, gene_biotype and transcript_biotype in column 9 in order to avoid an error in ICOUNT:
"[ValueError] need more than 1 value to unpack

in the context of pybedtools.cbedtools.Attributes.init suggests that a malformed attribute field is being encountered in your GTF file.",

according to:

github.com/10XGenomics/cellranger

CellRanger-ARC for Arbidosis failed with error

opened 07:55PM - 02 Jun 23 UTC

closed 11:13PM - 06 Nov 23 UTC

nilesh-iiita

I am running CellRanger-ARC for Arabidopsis and I am getting the following error…: ``` 2023-06-02 14:40:02 [runtime] (failed) ID.D3.SC_ATAC_GEX_COUNTER_CS.SC_ATAC_GEX_COUNTER._SC_ATAC_GEX_ANALYZER._PEAK_ANNOTATOR.ANNOTATE_PEAKS [error] Pipestance failed. Error log at: D3/SC_ATAC_GEX_COUNTER_CS/SC_ATAC_GEX_COUNTER/_SC_ATAC_GEX_ANALYZER/_PEAK_ANNOTATOR/ANNOTATE_PEAKS/fork0/join-uc5dd7a4574/_errors Log message: Traceback (most recent call last): File "/data/rc/apps/rc/software/CellRanger-ARC/2.0.0/external/martian/adapters/python/martian_shell.py", line 659, in _main stage.main() File "/data/rc/apps/rc/software/CellRanger-ARC/2.0.0/external/martian/adapters/python/martian_shell.py", line 629, in main lambda: self._module.join(args, outs, chunk_defs, chunk_outs) File "/data/rc/apps/rc/software/CellRanger-ARC/2.0.0/external/martian/adapters/python/martian_shell.py", line 589, in _run cmd() File "/data/rc/apps/rc/software/CellRanger-ARC/2.0.0/external/martian/adapters/python/martian_shell.py", line 629, in <lambda> lambda: self._module.join(args, outs, chunk_defs, chunk_outs) File "/data/rc/apps/rc/software/CellRanger-ARC/2.0.0/mro/atac/stages/analysis/annotate_peaks/init.py", line 105, in join gene_id_name_map = build_gene_id_name_map(ref_mgr) File "/data/rc/apps/rc/software/CellRanger-ARC/2.0.0/mro/atac/stages/analysis/annotate_peaks/init.py", line 60, in build_gene_id_name_map fields.attrs.get("gene_name", fields.attrs["gene_id"]) File "pybedtools/cbedtools.pyx", line 392, in pybedtools.cbedtools.Interval.attrs.get File "pybedtools/cbedtools.pyx", line 180, in pybedtools.cbedtools.Attributes.init ValueError: need more than 1 value to unpack ``` I have tried the following to resolve the issue: I have updated CellRanger-ARC to the latest version. I have checked the input data to make sure that it is in the correct format. I have checked the software used to run CellRanger-ARC to make sure that it is up to date and that it has all of the necessary dependencies. I am still unable to resolve the issue. Please help me to troubleshoot this issue. Thank you.

I did the same with the Solanum tuberosum genome.

However, running the CLIP-seq pipeline, while the Solanum samples led to an adequate number of unique reads (> 1M), Arabidopsis samples did not. The total number of reads didn’ even reach 1M reads. Furthermore, the subsequent CLIP-seq analyses in the Arabidopsis pipeline did not provide results for one of the crosslinked samples. For exemple the R1 sample didn’t provide thresholed sites.

The antibody used was designed against a Solanum betaceum peptide synthesised using our cDNA sequence of interest. The antibody was tested in Western-blot and immunoprecipitation analyses with protein extracts from both species. Some unspecific bands were observed, although the pattern was similar in both species.

So, the question is: in order to understand the low unique reads count in the Arabidopsis results, do you think that the Arabidopsis genome preparation, or even the fasta or the gtf files, can have some mistakes/features that might subsquently interfere with this count, or can this be explained by a possible low specificity of the antibody? Or is there another explanation?

Is there any procedure to verify the quality of the genome preparation and/or uploaded genome files? If so, how to overcome a low quality of genome preparation/files?

Thank you,

Ricardo Ferraz

Charlotte · 17 October 2025 15:32

Hi Ricardo

Sounds like a very cool project! You could try analysing a published dataset to see that you can reproduce those results? That would rule out any problem with the bioinformatics. I know there are some Arabadopsis iCLIP samples here: GEO Accession viewer That being said, the only time I’ve ever had a problem with something like that was when I had some weird line endings in my fasta because it was a custom genome produced from a file saved in Windows at some point. If you have any expected targets or binding sites you can check those in the genome browser. You can load your fasta and gtf in IGV or something to check it makes sense too.

One thing is that in your arabadopsis samples you have maybe more “unmapped too short” reads - could the RNAse treatment have been too strong? You can check the trimmed read lengths and the original gels optimising the RNAse treatment, this can be a common problem in CLIP.

Another suggestion is to look at the pre-mapping to tRNA and rRNA with Bowtie - do you lose a lot of reads here? If the pipeline got there, you can find a summary in the output of the MERGED_SUMMARY process that includes this pre-mapping as a line.

Hope this helps,
Charlotte

RicardoFerraz · 20 November 2025 17:58

Dear @Charlotte,

I am still trying the analysis with published data.

However, I did what you advised me to do concerning the pre-maped tRNA and rRNA and I saw that, while in the Solanum samples I only lost 9.27% of the cDNA (141471 in cDNA#), in Arabidopsis I lost 94.4% of the cDNA (157809 in cDNA#), not present at the respective ICOUNT_SUMMARY. What do you think is the cause of this difference and what do you think I can do to overcome this problem?

Charlotte · 24 November 2025 13:37

Sorry I think I don’t understand. When you look at the MERGED_SUMMARY output (not ICOUNT_SUMMARY) you will have a line for premapped ncRNA – are you saying that for your arabadopsis analysis, nearly 100% of the RNA premaps to tRNA and rRNA?

RicardoFerraz · 24 November 2025 17:37

Dear @Charlotte,

Thank you for replying.

So, with the Arabidopsis samples, what I see at the UV_R1.summary_type_premapadjusted.tsv is this:

Type	Length	cDNA #	cDNA %
CDS	33844720	1131	0.67647991195593
UTR3	7387480	417	0.24941832297579383
UTR5	5922612	274	0.16388638008481418
ncRNA	2089460	84	0.050242539880016035
intron	18497133	896	0.5359204253868376
intergenic	171592741	6578	3.9344693729850646
premapped rRNA_tRNA	NA	157809	94.38958304673154

While, at the R1.summary_gene.tsv I see this:

Type	Length	cDNA #	cDNA %
CDS	33844720	1131	12.057569296375267
UTR3	7387480	417	4.445628997867804
UTR5	5922612	274	2.9211087420042645
ncRNA	2089460	84	0.8955223880597015
intron	18497133	896	9.55223880597015
intergenic	171592741	6578	70.12793176972282

With the Solanum samples, I see this at the UV_R1.summary_type_premapadjusted.tsv:

Type	Length	cDNA #	cDNA %
CDS	36784647	75794	4.9648795397362635
UTR3	10597805	13639	0.8934215378850953
UTR5	7009625	19100	1.2511438795810044
ncRNA	222626	550	0.036027703338720025
intron	62954667	111156	7.281264349670477
intergenic	1503738722	1164893	76.30621713700288
premapped rRNA_tRNA	NA	141471	9.267045852785563

While at the Sb_R1.summary_type.tsv, I see this

Type	Length	cDNA #	cDNA %
CDS	36784647	75794	5.471969458506481
UTR3	10597805	13639	0.9846714970125591
UTR5	7009625	19100	1.3789299503585217
ncRNA	222626	550	0.0397074069474967
intron	62954667	111156	8.024939139374442
intergenic	1503738722	1164893	84.0997825478005

What does this mean about Arabidopsis samples?

Charlotte · 24 November 2025 21:10

The premapadjusted.tsv is the best because this incorporates the counts from the pre-mapping. It looks to me like probably the barcodes haven’t been properly accounted for or something with the arabidopsis sample, because the overall crosslink count is super low like not even reaching 200,000. I would guess that most reads end up unmapped. The barcode information can be hard to find in a publication/GEO record. If you share with me the reference I can help you find it, unless you’re certain it’s correct in your analysis, then I’m not so sure, but something weird is going on.

Topic		Replies	Views
Adding new Genomes to Flow Flow App	11	140	22 May 2025
Prepare CLIP-Seq Genomes Error CLIP	3	93	23 August 2024
Differential Analysis with iCLIP results, Error in data[[rowvar]] Differential Expression Analysis	1	10	8 January 2026
CLIP-Seq version 1.4 Announcements	0	25	27 February 2025
Issue with running the CLIP-seq pipeline with manually specified genome inputs Flow App bug	3	74	26 May 2025

Low unique reads count Arabidopsis CLIP-seq

Related topics