josch/img2pdf

Fork 10

Having an issue with latest version and tif files. #48

New issue

Closed

opened 2021-04-25 19:58:11 +00:00 by josch · 0 comments

josch commented

2021-04-25 19:58:11 +00:00

Owner

By George on 2018-08-21T21:13:34.013Z

So. I have code like this:

for root, dirs, files in os.walk(image_directory):
   for file in files:
     if file.endswith(".tif") or file.endswith(".TIF"):
        print"Discovered this TIF: ", os.path.join(root, file)
        image_files.append(os.path.join(root, file))

   if image_files:
      output_file=str(out_name)+".pdf"
      print "Putting all TIFs into ", output_file
      print image_files
      pdf_bytes=img2pdf.convert(image_files)
      file=open(out_path+output_file, "wb")
      file.write(pdf_bytes)
   else:
      print "Couldn't find any TIFs"

Using idle and img2pdf this code works just fine. (it is version 0.2.4 of img2pdf I believe?)

Today I tried using anaconda with this same code but with the latest version that I think is 0.3.1

Now when I run I get the following errors:

Traceback (most recent call last):

  File "ipython-input-4-b6fe1be62166", line 1, in <module>
    runfile('D:/Python Work/Stacking Tifs into PDFs/test of stacking tifs into pdfs.py', wdir='D:/Python Work/Stacking Tifs into PDFs')

  File "C:\Users\gsimler\AppData\Local\Continuum\anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 705, in runfile
    execfile(filename, namespace)

  File "C:\Users\gsimler\AppData\Local\Continuum\anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 87, in execfile
    exec(compile(scripttext, filename, 'exec'), glob, loc)

  File "D:/Python Work/Stacking Tifs into PDFs/test of stacking tifs into pdfs.py", line 47, in <module>
    pdf_bytes=img2pdf.convert(image_files)

  File "C:\Users\gsimler\AppData\Local\Continuum\anaconda2\lib\site-packages\img2pdf.py", line 1219, in convert
    rawdata, kwargs['colorspace'], kwargs['first_frame_only']):

  File "C:\Users\gsimler\AppData\Local\Continuum\anaconda2\lib\site-packages\img2pdf.py", line 840, in read_images
    offset, length = ccitt_payload_location_from_pil(imgdata)

  File "C:\Users\gsimler\AppData\Local\Continuum\anaconda2\lib\site-packages\img2pdf.py", line 709, in ccitt_payload_location_from_pil
    rows_per_strip = img.tag_v2[TiffImagePlugin.ROWSPERSTRIP]

  File "C:\Users\gsimler\AppData\Local\Continuum\anaconda2\lib\site-packages\PIL\TiffImagePlugin.py", line 496, in __getitem__
    data = self._tagdata[tag]

KeyError: 278

Any clue what is going on?

By josch on 2018-08-21T21:46:36.581Z

Could you assemble a minimal example that shows the problem you have and also provide the specific input image that causes the error to appear?

By George on 2018-08-21T22:08:57.892Z

Thanks for the quick response!

Sorry. I unfortunately cannot provide any of these documents as they are confidential legal documents. I can try later tonight to see if I can reproduce it using a regular tif and get back to you.

Not sure what you mean by a minimal example? The loop above is the exact code that generates the errors. In this instance I simply ran it through one directory with one tif file in it and got the long list of errors.

I did a quick diagnostic and the exact line causing the issue appears to be the "pdf_bytes=img2pdf.convert(image_files)" line. It does not get past it.

By josch on 2018-08-21T22:17:13.772Z

Depending on how these documents were created, you could create a document just like those but without any confidential info in it. For example, if they were created by a scanner, you could just scan a blank piece of paper.

Without being able to reproduce the problem myself, I'm afraid there is little I can do.

You are of course always free to fix the problem yourself and then send me the patch that solves it. I'm afraid it's only possible to fix these kind of problems properly if one can reproduce them. And that is only possible when one has some data that triggers the problem.

By josch on 2018-08-23T07:15:09.752Z

Hi @Nukular -- any chance that you can obtain some test data for me? Otherwise I'm afraid I have to close this issue because there is nothing I can do about it without being able to look into the issue myself.

By George on 2018-08-25T15:27:17.486Z

Hi @josch,

I can't quite replicate this with any old Tif. And as I said I have no way to send you a sample file as they are all confidential legal documents provided to us by a client.

However I am able to use Photoshop to extract document info from a sample .tif that I know didn't work. I don't know if this is helpful or not. Please let me know.

<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.6-c140 79.160451, 2017/05/06-01:08:21        ">
   <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
      <rdf:Description rdf:about=""
            xmlns:tiff="http://ns.adobe.com/tiff/1.0/"
            xmlns:xmp="http://ns.adobe.com/xap/1.0/"
            xmlns:dc="http://purl.org/dc/elements/1.1/"
            xmlns:photoshop="http://ns.adobe.com/photoshop/1.0/">
         <tiff:ImageWidth>2608</tiff:ImageWidth>
         <tiff:ImageLength>4352</tiff:ImageLength>
         <tiff:Compression>4</tiff:Compression>
         <tiff:PhotometricInterpretation>0</tiff:PhotometricInterpretation>
         <tiff:PlanarConfiguration>1</tiff:PlanarConfiguration>
         <tiff:XResolution>300/1</tiff:XResolution>
         <tiff:YResolution>300/1</tiff:YResolution>
         <xmp:CreateDate>2018-08-25T10:58:04-04:00</xmp:CreateDate>
         <xmp:ModifyDate>2017-06-15T14:24:34-04:00</xmp:ModifyDate>
         <xmp:MetadataDate>2017-06-15T14:24:34-04:00</xmp:MetadataDate>
         <dc:format>image/tiff</dc:format>
         <photoshop:ColorMode>0</photoshop:ColorMode>
      </rdf:Description>
   </rdf:RDF>
</x:xmpmeta>

By josch on 2018-08-26T20:24:41.359Z

Hi, unfortunately there is nothing particularly interesting in this RDF info. The metadata looks completely normal. If you have imagemagick installed, you could run identify -verbose on one of the tiffs to get more infos. Or if you have libtiff installed, running tiffdump or tiffinfo might be helpful.

By George on 2018-08-26T20:51:33.542Z

Hi @josch

Did a little further digging in photoshop. I tried to edit the image (deleting everything) but simply saving in photoshop made the image readable again. One difference I saw is that the ORIGINAL non-working image is group4 compression. Photoshop version has no compression. Could that cause it?

Along with that, this file did work using version 0.2.4, but that is only if I use IDLE on my windows machine (as 0.2.4 is a separate installation). Personal computer and Anaconda on my work machine, both use the latest version.

I just saw your recent reply and will see what I can find.

By josch on 2018-08-26T20:59:48.605Z

No, img2pdf supports others group4 compressed tiff images just fine. Such an image is part of its test suite.

Yes, img2pdf versions less than 0.3.0 should not have this bug because in those versions, tiff images like yours were saved as uncompressed image data, resulting in very large files.

What we have to find out is what makes your tiff image that doesn't work different from those that do work.

With Imagemagick, you can create group4 compressed tiff images using this command:

convert input.tiff -compress group4 output.tiff

You will see that img2pdf is able to handle this kind of tiff images just fine.

By George on 2018-08-27T02:27:50.319Z

Alright. Here is the output from tiffdump for the file that DOES work:

Check colorspace in photoshop (just saved).tif:
Magic: 0x4949 <little-endian> Version: 0x2a <ClassicTIFF>
Directory 0: offset 8 (0x8) next 0 (0)
SubFileType (254) LONG (4) 1<0>
ImageWidth (256) SHORT (3) 1<2608>
ImageLength (257) SHORT (3) 1<4352>
BitsPerSample (258) SHORT (3) 1<1>
Compression (259) SHORT (3) 1<1>
Photometric (262) SHORT (3) 1<0>
StripOffsets (273) LONG (4) 1<20092>
Orientation (274) SHORT (3) 1<1>
SamplesPerPixel (277) SHORT (3) 1<1>
RowsPerStrip (278) SHORT (3) 1<4352>
StripByteCounts (279) LONG (4) 1<1418752>
XResolution (282) RATIONAL (5) 1<300>
YResolution (283) RATIONAL (5) 1<300>
ResolutionUnit (296) SHORT (3) 1<2>
Software (305) ASCII (2) 36<Adobe Photoshop CC 2018  ...>
DateTime (306) ASCII (2) 20<2018:08:26 15:30:23\0>
700 (0x2bc) BYTE (1) 14239<0x3c 0x3f 0x78 0x70 0x61 0x63 0x6b 0x65 0x74 0x20 0x62 0x65 0x67 0x69 0x6e 0x3d 0x22 0xef 0xbb 0xbf 0x22 0x20 0x69 0x64 ...>
34377 (0x8649) BYTE (1) 5538<0x38 0x42 0x49 0x4d 0x4 0x25 00 00 00 00 00 0x10 00 00 00 00 00 00 00 00 00 00 00 00 ...>
34665 (0x8769) LONG (4) 1<1438844>

Here are the same results for the original version of the file that does NOT work with img2pdf:

Check colorspace in photoshop.tif:
Magic: 0x4949 <little-endian> Version: 0x2a <ClassicTIFF>
Directory 0: offset 8 (0x8) next 0 (0)
OldSubFileType (255) SHORT (3) 1<1>
ImageWidth (256) LONG (4) 1<2608>
ImageLength (257) LONG (4) 1<4352>
Compression (259) SHORT (3) 1<4>
Photometric (262) SHORT (3) 1<0>
FillOrder (266) SHORT (3) 1<1>
StripOffsets (273) LONG (4) 1<256>
StripByteCounts (279) LONG (4) 1<30159>
XResolution (282) RATIONAL (5) 1<300>
YResolution (283) RATIONAL (5) 1<300>
PlanarConfig (284) SHORT (3) 1<1>

By josch on 2018-08-27T05:43:49.305Z

That's it! The RowsPerStrip property is missing. This property is essential for decoding group4 tiff images. Currently, img2pdf requires it to be set but the tiff spec indeed says, that if it's not set, then the default value is 2**32 - 1 which basically means that the entire image is a single strip. Let me try to create a tiff like that and I'll come back to you in a bit!

By josch on 2018-08-27T06:15:35.778Z

From the function ccitt_payload_location_from_pil could you remove the line that says:

rows_per_strip = img.tag_v2[TiffImagePlugin.ROWSPERSTRIP]

And see if that fixes things for you?

By George on 2018-08-27T14:22:45.499Z

That seems to work!

By josch on 2018-11-20T15:31:23.336Z

Status changed to closed by commit 42f8ac54a8

*By George on 2018-08-21T21:13:34.013Z* So. I have code like this: for root, dirs, files in os.walk(image_directory): for file in files: if file.endswith(".tif") or file.endswith(".TIF"): print"Discovered this TIF: ", os.path.join(root, file) image_files.append(os.path.join(root, file)) if image_files: output_file=str(out_name)+".pdf" print "Putting all TIFs into ", output_file print image_files pdf_bytes=img2pdf.convert(image_files) file=open(out_path+output_file, "wb") file.write(pdf_bytes) else: print "Couldn't find any TIFs" Using idle and img2pdf this code works just fine. (it is version 0.2.4 of img2pdf I believe?) Today I tried using anaconda with this same code but with the latest version that I think is 0.3.1 Now when I run I get the following errors: ``` Traceback (most recent call last): File "ipython-input-4-b6fe1be62166", line 1, in <module> runfile('D:/Python Work/Stacking Tifs into PDFs/test of stacking tifs into pdfs.py', wdir='D:/Python Work/Stacking Tifs into PDFs') File "C:\Users\gsimler\AppData\Local\Continuum\anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 705, in runfile execfile(filename, namespace) File "C:\Users\gsimler\AppData\Local\Continuum\anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 87, in execfile exec(compile(scripttext, filename, 'exec'), glob, loc) File "D:/Python Work/Stacking Tifs into PDFs/test of stacking tifs into pdfs.py", line 47, in <module> pdf_bytes=img2pdf.convert(image_files) File "C:\Users\gsimler\AppData\Local\Continuum\anaconda2\lib\site-packages\img2pdf.py", line 1219, in convert rawdata, kwargs['colorspace'], kwargs['first_frame_only']): File "C:\Users\gsimler\AppData\Local\Continuum\anaconda2\lib\site-packages\img2pdf.py", line 840, in read_images offset, length = ccitt_payload_location_from_pil(imgdata) File "C:\Users\gsimler\AppData\Local\Continuum\anaconda2\lib\site-packages\img2pdf.py", line 709, in ccitt_payload_location_from_pil rows_per_strip = img.tag_v2[TiffImagePlugin.ROWSPERSTRIP] File "C:\Users\gsimler\AppData\Local\Continuum\anaconda2\lib\site-packages\PIL\TiffImagePlugin.py", line 496, in __getitem__ data = self._tagdata[tag] KeyError: 278 ``` Any clue what is going on? --- *By josch on 2018-08-21T21:46:36.581Z* --- Could you assemble a *minimal* example that shows the problem you have and also provide the specific input image that causes the error to appear? --- *By George on 2018-08-21T22:08:57.892Z* --- Thanks for the quick response! Sorry. I unfortunately cannot provide any of these documents as they are confidential legal documents. I can try later tonight to see if I can reproduce it using a regular tif and get back to you. Not sure what you mean by a minimal example? The loop above is the exact code that generates the errors. In this instance I simply ran it through one directory with one tif file in it and got the long list of errors. I did a quick diagnostic and the exact line causing the issue appears to be the "pdf_bytes=img2pdf.convert(image_files)" line. It does not get past it. --- *By josch on 2018-08-21T22:17:13.772Z* --- Depending on how these documents were created, you could create a document just like those but without any confidential info in it. For example, if they were created by a scanner, you could just scan a blank piece of paper. Without being able to reproduce the problem myself, I'm afraid there is little I can do. You are of course always free to fix the problem yourself and then send me the patch that solves it. I'm afraid it's only possible to fix these kind of problems properly if one can reproduce them. And that is only possible when one has some data that triggers the problem. --- *By josch on 2018-08-23T07:15:09.752Z* --- Hi @Nukular -- any chance that you can obtain some test data for me? Otherwise I'm afraid I have to close this issue because there is nothing I can do about it without being able to look into the issue myself. --- *By George on 2018-08-25T15:27:17.486Z* --- Hi @josch, I can't quite replicate this with any old Tif. And as I said I have no way to send you a sample file as they are all confidential legal documents provided to us by a client. However I am able to use Photoshop to extract document info from a sample .tif that I know didn't work. I don't know if this is helpful or not. Please let me know. ``` <x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.6-c140 79.160451, 2017/05/06-01:08:21 "> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> <rdf:Description rdf:about="" xmlns:tiff="http://ns.adobe.com/tiff/1.0/" xmlns:xmp="http://ns.adobe.com/xap/1.0/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:photoshop="http://ns.adobe.com/photoshop/1.0/"> <tiff:ImageWidth>2608</tiff:ImageWidth> <tiff:ImageLength>4352</tiff:ImageLength> <tiff:Compression>4</tiff:Compression> <tiff:PhotometricInterpretation>0</tiff:PhotometricInterpretation> <tiff:PlanarConfiguration>1</tiff:PlanarConfiguration> <tiff:XResolution>300/1</tiff:XResolution> <tiff:YResolution>300/1</tiff:YResolution> <xmp:CreateDate>2018-08-25T10:58:04-04:00</xmp:CreateDate> <xmp:ModifyDate>2017-06-15T14:24:34-04:00</xmp:ModifyDate> <xmp:MetadataDate>2017-06-15T14:24:34-04:00</xmp:MetadataDate> <dc:format>image/tiff</dc:format> <photoshop:ColorMode>0</photoshop:ColorMode> </rdf:Description> </rdf:RDF> </x:xmpmeta> ``` --- *By josch on 2018-08-26T20:24:41.359Z* --- Hi, unfortunately there is nothing particularly interesting in this RDF info. The metadata looks completely normal. If you have imagemagick installed, you could run `identify -verbose` on one of the tiffs to get more infos. Or if you have libtiff installed, running `tiffdump` or `tiffinfo` might be helpful. --- *By George on 2018-08-26T20:51:33.542Z* --- Hi @josch Did a little further digging in photoshop. I tried to edit the image (deleting everything) but simply saving in photoshop made the image readable again. One difference I saw is that the ORIGINAL non-working image is group4 compression. Photoshop version has no compression. Could that cause it? Along with that, this file did work using version 0.2.4, but that is only if I use IDLE on my windows machine (as 0.2.4 is a separate installation). Personal computer and Anaconda on my work machine, both use the latest version. I just saw your recent reply and will see what I can find. --- *By josch on 2018-08-26T20:59:48.605Z* --- No, img2pdf supports others group4 compressed tiff images just fine. Such an image is part of its test suite. Yes, img2pdf versions less than 0.3.0 should not have this bug because in those versions, tiff images like yours were saved as *uncompressed* image data, resulting in very large files. What we have to find out is what makes your tiff image that doesn't work different from those that do work. With Imagemagick, you can create group4 compressed tiff images using this command: convert input.tiff -compress group4 output.tiff You will see that img2pdf is able to handle this kind of tiff images just fine. --- *By George on 2018-08-27T02:27:50.319Z* --- Alright. Here is the output from tiffdump for the file that DOES work: ``` Check colorspace in photoshop (just saved).tif: Magic: 0x4949 <little-endian> Version: 0x2a <ClassicTIFF> Directory 0: offset 8 (0x8) next 0 (0) SubFileType (254) LONG (4) 1<0> ImageWidth (256) SHORT (3) 1<2608> ImageLength (257) SHORT (3) 1<4352> BitsPerSample (258) SHORT (3) 1<1> Compression (259) SHORT (3) 1<1> Photometric (262) SHORT (3) 1<0> StripOffsets (273) LONG (4) 1<20092> Orientation (274) SHORT (3) 1<1> SamplesPerPixel (277) SHORT (3) 1<1> RowsPerStrip (278) SHORT (3) 1<4352> StripByteCounts (279) LONG (4) 1<1418752> XResolution (282) RATIONAL (5) 1<300> YResolution (283) RATIONAL (5) 1<300> ResolutionUnit (296) SHORT (3) 1<2> Software (305) ASCII (2) 36<Adobe Photoshop CC 2018 ...> DateTime (306) ASCII (2) 20<2018:08:26 15:30:23\0> 700 (0x2bc) BYTE (1) 14239<0x3c 0x3f 0x78 0x70 0x61 0x63 0x6b 0x65 0x74 0x20 0x62 0x65 0x67 0x69 0x6e 0x3d 0x22 0xef 0xbb 0xbf 0x22 0x20 0x69 0x64 ...> 34377 (0x8649) BYTE (1) 5538<0x38 0x42 0x49 0x4d 0x4 0x25 00 00 00 00 00 0x10 00 00 00 00 00 00 00 00 00 00 00 00 ...> 34665 (0x8769) LONG (4) 1<1438844> ``` Here are the same results for the original version of the file that does NOT work with img2pdf: ``` Check colorspace in photoshop.tif: Magic: 0x4949 <little-endian> Version: 0x2a <ClassicTIFF> Directory 0: offset 8 (0x8) next 0 (0) OldSubFileType (255) SHORT (3) 1<1> ImageWidth (256) LONG (4) 1<2608> ImageLength (257) LONG (4) 1<4352> Compression (259) SHORT (3) 1<4> Photometric (262) SHORT (3) 1<0> FillOrder (266) SHORT (3) 1<1> StripOffsets (273) LONG (4) 1<256> StripByteCounts (279) LONG (4) 1<30159> XResolution (282) RATIONAL (5) 1<300> YResolution (283) RATIONAL (5) 1<300> PlanarConfig (284) SHORT (3) 1<1> ``` --- *By josch on 2018-08-27T05:43:49.305Z* --- That's it! The `RowsPerStrip` property is missing. This property is essential for decoding group4 tiff images. Currently, img2pdf requires it to be set but the tiff spec indeed says, that if it's not set, then the default value is `2**32 - 1` which basically means that the entire image is a single strip. Let me try to create a tiff like that and I'll come back to you in a bit! --- *By josch on 2018-08-27T06:15:35.778Z* --- From the function `ccitt_payload_location_from_pil` could you remove the line that says: rows_per_strip = img.tag_v2[TiffImagePlugin.ROWSPERSTRIP] And see if that fixes things for you? --- *By George on 2018-08-27T14:22:45.499Z* --- That seems to work! --- *By josch on 2018-11-20T15:31:23.336Z* --- Status changed to closed by commit 42f8ac54a8662cc77bf54d40947b6a563a112157