Low unique reads count Arabidopsis CLIP-seq

Dear Flow.Bio team,

I’m doing a CLIP-seq pipeline with Arabidopsis crosslinked (R1 and R2) and non-crosslinked (noUV_R1) samples and a sample without the specific antibody, where I used an unspecific IgG antibody instead (IgG_R1).

Firstly, I prepared the genome for the CLIP pipeline. I uploaded the genome .fasta sequence and .gtf annotation file and the ncRNA .fasta sequence:

In the .gtf file, I had to remove all the attributes that were not gene_id, transcript_id, exon_number, gene_biotype and transcript_biotype in column 9 in order to avoid an error in ICOUNT:
"[ValueError] need more than 1 value to unpack

in the context of pybedtools.cbedtools.Attributes.init suggests that a malformed attribute field is being encountered in your GTF file.",

according to:

I did the same with the Solanum tuberosum genome.

However, running the CLIP-seq pipeline, while the Solanum samples led to an adequate number of unique reads (> 1M), Arabidopsis samples did not. The total number of reads didn’ even reach 1M reads. Furthermore, the subsequent CLIP-seq analyses in the Arabidopsis pipeline did not provide results for one of the crosslinked samples. For exemple the R1 sample didn’t provide thresholed sites.


The antibody used was designed against a Solanum betaceum peptide synthesised using our cDNA sequence of interest. The antibody was tested in Western-blot and immunoprecipitation analyses with protein extracts from both species. Some unspecific bands were observed, although the pattern was similar in both species.

So, the question is: in order to understand the low unique reads count in the Arabidopsis results, do you think that the Arabidopsis genome preparation, or even the fasta or the gtf files, can have some mistakes/features that might subsquently interfere with this count, or can this be explained by a possible low specificity of the antibody? Or is there another explanation?

Is there any procedure to verify the quality of the genome preparation and/or uploaded genome files? If so, how to overcome a low quality of genome preparation/files?

Thank you,

Ricardo Ferraz

Hi Ricardo

Sounds like a very cool project! You could try analysing a published dataset to see that you can reproduce those results? That would rule out any problem with the bioinformatics. I know there are some Arabadopsis iCLIP samples here: GEO Accession viewer That being said, the only time I’ve ever had a problem with something like that was when I had some weird line endings in my fasta because it was a custom genome produced from a file saved in Windows at some point. If you have any expected targets or binding sites you can check those in the genome browser. You can load your fasta and gtf in IGV or something to check it makes sense too.

One thing is that in your arabadopsis samples you have maybe more “unmapped too short” reads - could the RNAse treatment have been too strong? You can check the trimmed read lengths and the original gels optimising the RNAse treatment, this can be a common problem in CLIP.

Another suggestion is to look at the pre-mapping to tRNA and rRNA with Bowtie - do you lose a lot of reads here? If the pipeline got there, you can find a summary in the output of the MERGED_SUMMARY process that includes this pre-mapping as a line.

Hope this helps,
Charlotte

Dear @Charlotte,

I am still trying the analysis with published data.

However, I did what you advised me to do concerning the pre-maped tRNA and rRNA and I saw that, while in the Solanum samples I only lost 9.27% of the cDNA (141471 in cDNA#), in Arabidopsis I lost 94.4% of the cDNA (157809 in cDNA#), not present at the respective ICOUNT_SUMMARY. What do you think is the cause of this difference and what do you think I can do to overcome this problem?

Sorry I think I don’t understand. When you look at the MERGED_SUMMARY output (not ICOUNT_SUMMARY) you will have a line for premapped ncRNA – are you saying that for your arabadopsis analysis, nearly 100% of the RNA premaps to tRNA and rRNA?

Dear @Charlotte,

Thank you for replying.

So, with the Arabidopsis samples, what I see at the UV_R1.summary_type_premapadjusted.tsv is this:

Type Length cDNA # cDNA %
CDS 33844720 1131 0.67647991195593
UTR3 7387480 417 0.24941832297579383
UTR5 5922612 274 0.16388638008481418
ncRNA 2089460 84 0.050242539880016035
intron 18497133 896 0.5359204253868376
intergenic 171592741 6578 3.9344693729850646
premapped rRNA_tRNA NA 157809 94.38958304673154

While, at the R1.summary_gene.tsv I see this:

Type Length cDNA # cDNA %
CDS 33844720 1131 12.057569296375267
UTR3 7387480 417 4.445628997867804
UTR5 5922612 274 2.9211087420042645
ncRNA 2089460 84 0.8955223880597015
intron 18497133 896 9.55223880597015
intergenic 171592741 6578 70.12793176972282

With the Solanum samples, I see this at the UV_R1.summary_type_premapadjusted.tsv:

Type Length cDNA # cDNA %
CDS 36784647 75794 4.9648795397362635
UTR3 10597805 13639 0.8934215378850953
UTR5 7009625 19100 1.2511438795810044
ncRNA 222626 550 0.036027703338720025
intron 62954667 111156 7.281264349670477
intergenic 1503738722 1164893 76.30621713700288
premapped rRNA_tRNA NA 141471 9.267045852785563

While at the Sb_R1.summary_type.tsv, I see this

Type Length cDNA # cDNA %
CDS 36784647 75794 5.471969458506481
UTR3 10597805 13639 0.9846714970125591
UTR5 7009625 19100 1.3789299503585217
ncRNA 222626 550 0.0397074069474967
intron 62954667 111156 8.024939139374442
intergenic 1503738722 1164893 84.0997825478005

What does this mean about Arabidopsis samples?

The premapadjusted.tsv is the best because this incorporates the counts from the pre-mapping. It looks to me like probably the barcodes haven’t been properly accounted for or something with the arabidopsis sample, because the overall crosslink count is super low like not even reaching 200,000. I would guess that most reads end up unmapped. The barcode information can be hard to find in a publication/GEO record. If you share with me the reference I can help you find it, unless you’re certain it’s correct in your analysis, then I’m not so sure, but something weird is going on.