Demultiplexing error - Ultraplex

Hi,

I am trying to demultiplex a paired-end fastq file
(Execution - Flow).
The Ultraplex has been running for 10 hours with no output, and there is an error:

“File “/usr/local/lib/python3.9/site-packages/ultraplex/main.py”, line 696, in run
raise e
gzip.BadGzipFile: CRC check failed 3751927365 != 485080615”

@Charlotte
This is similar to an error I was getting with demultiplexing a “CAP.fastq.” file, that turned out to be corrupt. But this time, this is a totally different (paired-end) fastq file.

I believe I uploaded it as instructed by Sam Ireland:
1-uploading the R1 file (I confirmed it was uploaded 100%, until the “view file” message shows up)
2-uploading the R2 file and linking it to the R1 file (I confirmed it was uploaded 100%).

I do not understand how this file might be corrupted as well.

Thank you.
p

Hi Paulo

How did you originally get the fastq file? I think the corruption is happening when you download these files, rather than when you upload them to Flow, as you say this happens without error.

If possible, you can check that your download of a file has been completed without error by comparing it’s “hash” on the server/where you downloaded it from, vs the “hash” calculated on your laptop. An easy example is to use the md5sum command to produce such a hash.

Cheers
Charlotte

Hi Charlotte,

I have downloaded the fastq files again, and checked that the checksums match those of the original files on the server I downloaded them from.

original R1 file:
7e771bfadb5dd5ac2813898965a5c244 CLIP-seq_R1_001.fastq.gz
my download:
7e771bfadb5dd5ac2813898965a5c244 CLIP-seq_R1_001.fastq.gz

original R2 file:
a71d6e85c83880fe5c26c229f5bc78f9 CLIP-seq_R2_001.fastq.gz
my download:
a71d6e85c83880fe5c26c229f5bc78f9 CLIP-seq_R2_001.fastq.gz

I then deleted the 'supposedly corrupted fastq files from Flow, and uploaded these new ones. Started a demultiplex execution,
https://app.flow.bio/executions/846560256457364344
and the Ultraplex fails again (or it has not failed, but it’s showing the same error).

Thank you for replying so quickly. I wanted to try your suggestion before replying, thanks. I do not know could possibly be causing this Ultraplex error.

"
File “/usr/local/lib/python3.9/site-packages/ultraplex/main.py”, line 696, in run
raise e
gzip.BadGzipFile: CRC check failed 2645973715 != 464354527"

Thank you.
Paulo

Hi Paulo

Thanks for this. The error is definitely caused by the fastq files being corrupted in some way. I will let @sam investigate this for you further. Perhaps Sam can get the hash of the files currently on the Flow server for you, if it is different then that would tell us that the upload was probably interrupted causing the corruption. Do you have a good stable internet connection when making the upload (ie. by ethernet cable?).

Cheers
Charlotte

Hi

Another thought - the corruption of the original file I looked into for you happened when it was archived and unarchived on the Crick server - can I just check the origin of this fastq file - was it also archived and unarchived on the Crick server in a similar way?

Cheers
Charlotte

Hi Charlotte,

I do not have a stable internet connect via ethernet cable. The files were uploaded via WiFi. Are there reports on ‘failed’ demultiplexing of corrupted files due to uploads via WiFi?

Re the previous corrupted fast file (the CAP.fastq, right?), I got it from the CAMP (later Nemo) sever from the Crick. It was archived, and I believe I downloaded and uploaded it as an archived file (fastq.gz).

Thanks for investigating this further! Please, tell me if I there anything I can try on my end to bypass this issue.

Thank you.
Paulo

Hi

When uploading and downloading files having a stable internet connection is important to prevent corruption of files. CAMP archive is something different that just zipping or unzipping files. The CAP file was in a special CAMP archive and it became corrupted through poor transition out of this - when I contacted HPC and got it unarchived again and uploaded for you everything was fine. Did your other CLIP files that you now have a problem with go through this process?

Hi Charlotte,

Thanks for clarifying using the CAP file example, I now understand what you mean.

The other CLIP files were downloaded via FTP from a sequencing facility server, not CAMP. I imagine these did not undergo a unarchive - archive cycle.

(This sequencing facility provided the checksums files along with the fastq files, which allowed me to then check the downloaded files on my local drive.)

Thanks for being on top of this, really silly issue. I do not know how to sort it out, beyond trying to download and upload files using a cable connection, which I cannot do at the moment.

Thank you.
p

1 Like

Hi Paulo,

I do get different MD5 hashes for the two files on the server. Specifically:

CLIP-seq_R2_001.fastq.gz 6f9b43e4fdc07e5a5dc624d9c85ea3b2
CLIP-seq_R1_001.fastq.gz 564228394fa424cd4c20ac93a95dcca8

So these files do appear to be different from the originals, though I’m not immediately sure how. Could you confirm their exact size in bytes? On the server I have:

CLIP-seq_R2_001.fastq.gz 23,634,404,503 bytes
CLIP-seq_R1_001.fastq.gz 21,052,084,277 bytes

Sam

Hi Paulo

Just to be 100% sure about everything - here you mentioned that you re-downloaded the files from the server to get the md5 hash, so the hash you gave us is not the hash of the exact file you uploaded to flow - can you please upload these files to Flow so we be 100% sure about the hashes. Otherwise, it could be that the first time you downloaded from the server it got corrupted and you uploaded this to Flow. If you re-upload this exact file with the hash you gave us then we can be sure that if there is a difference on Flow it is because of something that happened during upload.

This might seem pedantic, but it’s important to be thorough. It would be very very unusual for the corruption to happen during upload because there is extensive checking in place during upload to ensure file integrity.

Cheers
Charlotte

Hi Sam and Charlotte,

I confirm that the size in bytes are exactly the same.

On the server, you see:
CLIP-seq_R1_001.fastq.gz 21,052,084,277 bytes
CLIP-seq_R2_001.fastq.gz 23,634,404,503 bytes

On my local computer, I see:
CLIP-seq_R1_001.fastq.gz 21,052,084,277 bytes
CLIP-seq_R2_001.fastq.gz 23,634,404,503 bytes

Charlotte:
“If you re-upload this exact file with the hash you gave us then we can be sure that if there is a difference on Flow it is because of something that happened during upload.” This is what I have done.

1st, I re-downloaded the files from the server,
2nd checked the MD5 hashes of the downloaded files locally on my computer,
3rd checked that the MD5 hashes of the files on the server are the same.
4th. Only after, I uploaded the re-downloaded files to Flow.
So, the MD5 hashes I showed here should be from the re-downloaded files.
But I will check again.

Thank you for assisting on this issue. I was out of work for two days a personal situation, hence my slow response.

Kind regards,
p

" 4th. Only after, I uploaded the re-downloaded files to Flow ."

And are you 100% sure that using this file you also get the Ultraplex error?

Perhaps another piece of information that would be useful - do you get any error if you try to gunzip the fastq.gz on your laptop?

Hi Charlotte,
I will re-do these steps now to be 100%.

I am checking if the gunzip of the fastq.gz gives errors.

Thanks,
p

Hi Charlotte, Sam,

A way around this issue would be to bypass me downloading the files from the server. The NGS facility can directly upload the files themselves to Flow, but this would have to be via command line or via sFTP protocol or similar. Can we do this?

I see no errors with the gunzip of the fastq.
I did not have the chance yet to repeat the process of download and upload the files.

Thank you,
Paulo

Hi they can do that using the flowbio python library Uploading with flowbio - Protocol API Reference

But just to check - you gave us the hash of the file that you uploaded, and when you ran demultiplex on that uploaded file it failed? I’m not sure theres a need for you to repeat download and upload? - the hashes just need to be checked at every step so we can trace where the issue occurs.

Hi Charlotte,

Thank you, I will forward this protocol API.

I see, right. I just wanted to be 100% sure that the file I upload to Flow was the same as that of the hash I gave you, but I can just check the hash of that uploaded file on Flow, I am downloading it.

I understand this is what Sam already confirmed, i.e. that the hash of the uploaded file on Flow is different, so I should see the same. If that checks, I’d conclude the the corruption happened during my upload to Flow.

Just to confirm, at the moment I have:
local R1 file (today):
7e771bfadb5dd5ac2813898965a5c244 CLIP-seq_R1_001.fastq.gz
R1 file on server:
7e771bfadb5dd5ac2813898965a5c244 CLIP-seq_R1_001.fastq.gz

local file R2 (today):
a71d6e85c83880fe5c26c229f5bc78f9 CLIP-seq_R2_001.fastq.gz
R2 file on server:
a71d6e85c83880fe5c26c229f5bc78f9 CLIP-seq_R2_001.fastq.gz

Thank you,
Paulo

“I just wanted to be 100% sure that the file I upload to Flow was the same as that of the hash I gave you” This is important, if you are not 100% sure then its not meaningful that the hash of the file on Flow is different. (There is no need for you to download again to tell the hash, as you say sam can access this and downloading introduces an extra variable)

So let us know when you are 100% sure, then sam can check the server hash and we can conclude whether this corruption happened on upload.

Hi Charlotte,

I understand, this is why I wanted to upload the files to Flow (again) before saying I am 100% sure. I am doing that now.

At any rate, the hash of the Flow-downloaded file is:
CLIP-seq_R1_001.fastq.gz 564228394fa424cd4c20ac93a95dcca8
as Sam reported. This should confirm the error happened during upload, unless I did not upload the re-downloaded files.

Just to look at dates:
-The 1st upload to Flow was on Apr 30th.
-I deleted the 2 fastq files after noticing the Ultraplex error and discussing it here.
-I re-downloaded the files and uploaded them to Flow on May 3rd (current files).

At any rate, please bear in attention that we’re assuming the 1st downloaded files that I uploaded to Flow were corrupt in some way, when there is no evidence in that direction. What we know is that:
-hash of files in my computer and ftp server are the same,
-hash on Flow server is different.

Thanks for your patience.
Paulo

Thanks,
Paulo

Hi,

I have re-uploaded the files using an Ethernet cable, and obtained the same ‘bad gzip’ Ultraplex error, execution: Flow

I do not know how the file corruption would happen during Flow upload, since I use the cable this time around. Is there anything else I can check on my end?

Thank you.
p