Prepare CLIP-Seq Genomes Error

Hello Flow Community,

This is my first time working with the NextFlow CLIP-seq pipeline, and I am running into an issue getting my organism’s genome to pass the Prepare CLIP-Seq Genome [1.1] module.

The issue seems to arise from the GTF file I have for my organism, as the following processes continuously fail:

CLIPSEQ_FIND_LONGEST_TRANSCRIPT

There are 0 protein coding transcripts   
These belong to 0 genes   
Traceback (most recent call last):   
  File "/media/storage/production/executions/445220002467081580/work/66/ae9eaafed5b7ff2a1f2a1509d16867/.command.sh", line 113, in <module>   
    main(args.process_name, args.gtf, args.output)   
  File "/media/storage/production/executions/445220002467081580/work/66/ae9eaafed5b7ff2a1f2a1509d16867/.command.sh", line 87, in main   
    gtf_output[-1] = gtf_output[-1].strip("\n")   
IndexError: list index out of range   

ICOUNT_SEG_GTF

Executing the following command: iCount segment Vibrio_fischeri_ES114_bracketsremoved.cmd.gtf Vibrio_fischeri_ES114_bracketsremoved_seg.gtf GCA_000011805.1_ASM1180v1_genomic.fasta.fai   
Input parameters for function 'get_segments' in iCount.genomes.segment   
    annotation: Vibrio_fischeri_ES114_bracketsremoved.cmd.gtf   
    segmentation: Vibrio_fischeri_ES114_bracketsremoved_seg.gtf   
    fai: GCA_000011805.1_ASM1180v1_genomic.fasta.fai   
    report_progress: False   
   
Traceback (most recent call last):   
  File "/usr/local/lib/python3.9/site-packages/iCount/cli.py", line 448, in main   
    result_object = func(**args)   
  File "/usr/local/lib/python3.9/site-packages/iCount/genomes/segment.py", line 1016, in get_segments   
    process_gene(gene_content)   
  File "/usr/local/lib/python3.9/site-packages/iCount/genomes/segment.py", line 1003, in process_gene   
    gene_content[id_] = _process_transcript_group(transcript_group)   
  File "/usr/local/lib/python3.9/site-packages/iCount/genomes/segment.py", line 755, in _process_transcript_group   
    assert exons   
AssertionError   
   
During handling of the above exception, another exception occurred:   
   
Traceback (most recent call last):   
  File "/usr/local/bin/iCount-Mini", line 10, in <module>   
    sys.exit(main())   
  File "/usr/local/lib/python3.9/site-packages/iCount/cli.py", line 456, in main   
    exception_message = exception.args[0]   
IndexError: tuple index out of range   
Traceback (most recent call last):   
  File "/usr/local/lib/python3.9/site-packages/iCount/cli.py", line 448, in main   
    result_object = func(**args)   
  File "/usr/local/lib/python3.9/site-packages/iCount/genomes/segment.py", line 1016, in get_segments   
    process_gene(gene_content)   
  File "/usr/local/lib/python3.9/site-packages/iCount/genomes/segment.py", line 1003, in process_gene   
    gene_content[id_] = _process_transcript_group(transcript_group)   
  File "/usr/local/lib/python3.9/site-packages/iCount/genomes/segment.py", line 755, in _process_transcript_group   
    assert exons   
AssertionError   
   
During handling of the above exception, another exception occurred:   
   
Traceback (most recent call last):   
  File "/usr/local/bin/iCount-Mini", line 10, in <module>   
    sys.exit(main())   
  File "/usr/local/lib/python3.9/site-packages/iCount/cli.py", line 456, in main   
    exception_message = exception.args[0]   
IndexError: tuple index out of range 

ICOUNT_SEG_FILTGTF

Reading annotation file.   
Number of entries in input annotation: 15788   
Checking for basic flag...   
Basic flag available.   
3 entries flagged as basic.   
Number of entries after filtering for tag "basic": 3988   
Traceback (most recent call last):   
  File "/usr/local/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3621, in get_loc   
    return self._engine.get_loc(casted_key)   
  File "pandas/_libs/index.pyx", line 136, in pandas._libs.index.IndexEngine.get_loc   
  File "pandas/_libs/index.pyx", line 163, in pandas._libs.index.IndexEngine.get_loc   
  File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item   
  File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item   
KeyError: 0   
   
The above exception was the direct cause of the following exception:   
   
Traceback (most recent call last):   
  File "/media/storage/production/executions/445220002467081580/work/71/a7bfa84df578d1e7f764d949538c00/.command.sh", line 97, in <module>   
    main(args.process_name, args.gtf, args.output)   
  File "/media/storage/production/executions/445220002467081580/work/71/a7bfa84df578d1e7f764d949538c00/.command.sh", line 69, in main   
    gene_ids = df_TSL["annotations"].str.split(";", n=1, expand=True)[0].unique().tolist()   
  File "/usr/local/lib/python3.10/site-packages/pandas/core/frame.py", line 3505, in __getitem__   
    indexer = self.columns.get_loc(key)   
  File "/usr/local/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3623, in get_loc   
    raise KeyError(key) from err   
KeyError: 0

I have reproduced a section from my GTF file below in case that is helpful:

#gtf-version 2.2
#!genome-build ASM1180v1
#!genome-build-accession NCBI_Assembly:GCA_000011805.1
CP000020.2	Genbank	gene	313	747	.	-	.	gene_id "VF_0001"; transcript_id ""; gbkey "Gene"; gene "mioC"; gene_biotype "protein_coding"; locus_tag "VF_0001"; old_locus_tag "VF0001"; 
CP000020.2	Genbank	CDS	316	747	.	-	0	gene_id "VF_0001"; transcript_id "unassigned_transcript_1"; gbkey "CDS"; gene "mioC"; locus_tag "VF_0001"; product "FMN-binding protein MioC"; protein_id "AAW84496.1"; transl_table "11"; exon_number "1"; 
CP000020.2	Genbank	start_codon	745	747	.	-	0	gene_id "VF_0001"; transcript_id "unassigned_transcript_1"; gbkey "CDS"; gene "mioC"; locus_tag "VF_0001"; product "FMN-binding protein MioC"; protein_id "AAW84496.1"; transl_table "11"; exon_number "1"; 

Best,

Jacob

Hi Jacob,

Wow what an interesting organism!!! Super cool you are doing CLIP.

The pipeline is optimised to work with Ensembl/Gencode annotations so it’s not unusual to have issues with non-model organism annotations. Essentially you will want to edit your gtf to be as close to Ensembl-style as possible. Another common issue with non-standard gtf is genes or exons falling off the ends of chromosomes so that’s another thing to check for.

Once you have an annotation that “plays ball” we will need to add your species to our database - this is something you can’t currently do yourself. There are also a few settings I’d reccommend for analysing bacterial data with the CLIP-Seq pipeline as we did recently in this paper: https://www.embopress.org/doi/full/10.1038/s44320-024-00031-y#sec-4

Note the STAR settings to ensure no spliced alignments are made:
"Execution was run with default parameters except in the case of STAR, where the following arguments were detailed: “–outFilterMultimapNmax 100 --outFilterMultimapScoreRange 1 --outSAMattributes All --alignIntronMin 1000000 --outFilterScoreMin 10 --alignEndsType Extend5pOfRead1”

You will also need a file with tRNA/rRNA sequences for your organism for the pre-mapping step. Please feel free to reach out again if you try some more things and are still banging your head, I can have a little go.

Best,
Charlotte

Hi Charlotte,

Thank you for the advice here, I was able to resolve the GTF error with my organism. For creating a database entry for Vibrio fischeri, what are the next steps?

Best,

Jacob

Hi Jacob

That’s marvellous! I will have to do that for you - please email me your gtf, fasta and tRNA/rRNA fasta for pre-mapping ideally with some details of where those sequences are from and I’ll get it set up for you (charlotte.capitanchik@goodwright.com)

Cheers!
Charlotte