I’m doing a CLIP-seq pipeline with Arabidopsis crosslinked (R1 and R2) and non-crosslinked (noUV_R1) samples and a sample without the specific antibody, where I used an unspecific IgG antibody instead (IgG_R1).
Firstly, I prepared the genome for the CLIP pipeline. I uploaded the genome .fasta sequence and .gtf annotation file and the ncRNA .fasta sequence:
In the .gtf file, I had to remove all the attributes that were not gene_id, transcript_id, exon_number, gene_biotype and transcript_biotype in column 9 in order to avoid an error in ICOUNT:
"[ValueError] need more than 1 value to unpack
in the context of pybedtools.cbedtools.Attributes.init suggests that a malformed attribute field is being encountered in your GTF file.",
according to:
I did the same with the Solanum tuberosum genome.
However, running the CLIP-seq pipeline, while the Solanum samples led to an adequate number of unique reads (> 1M), Arabidopsis samples did not. The total number of reads didn’ even reach 1M reads. Furthermore, the subsequent CLIP-seq analyses in the Arabidopsis pipeline did not provide results for one of the crosslinked samples. For exemple the R1 sample didn’t provide thresholed sites.
The antibody used was designed against a Solanum betaceum peptide synthesised using our cDNA sequence of interest. The antibody was tested in Western-blot and immunoprecipitation analyses with protein extracts from both species. Some unspecific bands were observed, although the pattern was similar in both species.
So, the question is: in order to understand the low unique reads count in the Arabidopsis results, do you think that the Arabidopsis genome preparation, or even the fasta or the gtf files, can have some mistakes/features that might subsquently interfere with this count, or can this be explained by a possible low specificity of the antibody? Or is there another explanation?
Is there any procedure to verify the quality of the genome preparation and/or uploaded genome files? If so, how to overcome a low quality of genome preparation/files?
Sounds like a very cool project! You could try analysing a published dataset to see that you can reproduce those results? That would rule out any problem with the bioinformatics. I know there are some Arabadopsis iCLIP samples here: GEO Accession viewer That being said, the only time I’ve ever had a problem with something like that was when I had some weird line endings in my fasta because it was a custom genome produced from a file saved in Windows at some point. If you have any expected targets or binding sites you can check those in the genome browser. You can load your fasta and gtf in IGV or something to check it makes sense too.
One thing is that in your arabadopsis samples you have maybe more “unmapped too short” reads - could the RNAse treatment have been too strong? You can check the trimmed read lengths and the original gels optimising the RNAse treatment, this can be a common problem in CLIP.
Another suggestion is to look at the pre-mapping to tRNA and rRNA with Bowtie - do you lose a lot of reads here? If the pipeline got there, you can find a summary in the output of the MERGED_SUMMARY process that includes this pre-mapping as a line.