I’m doing a CLIP-seq pipeline with Arabidopsis crosslinked (R1 and R2) and non-crosslinked (noUV_R1) samples and a sample without the specific antibody, where I used an unspecific IgG antibody instead (IgG_R1).
Firstly, I prepared the genome for the CLIP pipeline. I uploaded the genome .fasta sequence and .gtf annotation file and the ncRNA .fasta sequence:
In the .gtf file, I had to remove all the attributes that were not gene_id, transcript_id, exon_number, gene_biotype and transcript_biotype in column 9 in order to avoid an error in ICOUNT:
"[ValueError] need more than 1 value to unpack
in the context of pybedtools.cbedtools.Attributes.init suggests that a malformed attribute field is being encountered in your GTF file.",
according to:
I did the same with the Solanum tuberosum genome.
However, running the CLIP-seq pipeline, while the Solanum samples led to an adequate number of unique reads (> 1M), Arabidopsis samples did not. The total number of reads didn’ even reach 1M reads. Furthermore, the subsequent CLIP-seq analyses in the Arabidopsis pipeline did not provide results for one of the crosslinked samples. For exemple the R1 sample didn’t provide thresholed sites.
The antibody used was designed against a Solanum betaceum peptide synthesised using our cDNA sequence of interest. The antibody was tested in Western-blot and immunoprecipitation analyses with protein extracts from both species. Some unspecific bands were observed, although the pattern was similar in both species.
So, the question is: in order to understand the low unique reads count in the Arabidopsis results, do you think that the Arabidopsis genome preparation, or even the fasta or the gtf files, can have some mistakes/features that might subsquently interfere with this count, or can this be explained by a possible low specificity of the antibody? Or is there another explanation?
Is there any procedure to verify the quality of the genome preparation and/or uploaded genome files? If so, how to overcome a low quality of genome preparation/files?
Sounds like a very cool project! You could try analysing a published dataset to see that you can reproduce those results? That would rule out any problem with the bioinformatics. I know there are some Arabadopsis iCLIP samples here: GEO Accession viewer That being said, the only time I’ve ever had a problem with something like that was when I had some weird line endings in my fasta because it was a custom genome produced from a file saved in Windows at some point. If you have any expected targets or binding sites you can check those in the genome browser. You can load your fasta and gtf in IGV or something to check it makes sense too.
One thing is that in your arabadopsis samples you have maybe more “unmapped too short” reads - could the RNAse treatment have been too strong? You can check the trimmed read lengths and the original gels optimising the RNAse treatment, this can be a common problem in CLIP.
Another suggestion is to look at the pre-mapping to tRNA and rRNA with Bowtie - do you lose a lot of reads here? If the pipeline got there, you can find a summary in the output of the MERGED_SUMMARY process that includes this pre-mapping as a line.
I am still trying the analysis with published data.
However, I did what you advised me to do concerning the pre-maped tRNA and rRNA and I saw that, while in the Solanum samples I only lost 9.27% of the cDNA (141471 in cDNA#), in Arabidopsis I lost 94.4% of the cDNA (157809 in cDNA#), not present at the respective ICOUNT_SUMMARY. What do you think is the cause of this difference and what do you think I can do to overcome this problem?
Sorry I think I don’t understand. When you look at the MERGED_SUMMARY output (not ICOUNT_SUMMARY) you will have a line for premapped ncRNA – are you saying that for your arabadopsis analysis, nearly 100% of the RNA premaps to tRNA and rRNA?
The premapadjusted.tsv is the best because this incorporates the counts from the pre-mapping. It looks to me like probably the barcodes haven’t been properly accounted for or something with the arabidopsis sample, because the overall crosslink count is super low like not even reaching 200,000. I would guess that most reads end up unmapped. The barcode information can be hard to find in a publication/GEO record. If you share with me the reference I can help you find it, unless you’re certain it’s correct in your analysis, then I’m not so sure, but something weird is going on.