Embedded fonts with Subset are not marked as Subset in PDF

rvk · 2022-04-28 14:23:32

I'm trying to see why SynPdf files don't show embedded subset fonts as Subset in Adobe reader.
Looking at pdffonts from xpdf-tools you see that for Ghostscript (Temp0.pdf) the fonts are embedded but also Subsetted and Unicode.

According to the official PDF specs, embedded subset codes need to be proceeded by 6 random character followed by a + sign.
The fonts in Temp1.pdf from SynPdf are not marked as subset (and are also not embedded as unicode).

5.5.3 Font Subsets
PDF 1.1 permits documents to include subsets of Type 1 and TrueType fonts. The font and font descriptor that describe a font subset are slightly different from those of ordinary fonts. These differences allow an application to recognize font subsets and to merge documents containing different subsets of the same font. (For more information on font descriptors, see Section 5.7, “Font Descriptors.”) For a font subset, the PostScript name of the font —the value of the font’s BaseFont entry and the font descriptor’s FontName entry— begins with a tag followed by a plus sign (+). The tag consists of exactly six uppercase letters; the choice of letters is arbitrary, but different subsets in the same PDF file must have different tags. For example, EOODIA+Poetica is the name of a subset of Poetica®, a Type 1 font. (See implementation note 63 in Appendix H.)

https://ghostscript.com/~robin/pdf_reference17.pdf

This one is from ghostscript:

S:\pdfs\xpdf-tools-win-4.02\bin32>pdffonts -loc c:\temp\Test0.pdf
name                                           type              emb sub uni prob object ID location
---------------------------------------------- ----------------- --- --- --- ---- --------- --------
RDZRPI+Code128                                 TrueType          yes yes yes          12  0 embedded
UFQSLH+KIXBarcode                              TrueType          yes yes yes          14  0 embedded
ZRSKVS+SegoeScript                             TrueType          yes yes yes          16  0 embedded
UFQSLH+Tahoma                                  TrueType          yes yes yes           8  0 embedded
RDZRPI+Code3de9                                TrueType          yes yes yes          10  0 embedded

This one is from SynPdf (note the "no" in the sub column, and in Adobe reader there is also no Subset keyword):

S:\pdfs\xpdf-tools-win-4.02\bin32>pdffonts -loc c:\temp\Test1.pdf
name                                           type              emb sub uni prob object ID location
---------------------------------------------- ----------------- --- --- --- ---- --------- --------
Tahoma                                         TrueType          no  no  no            6  0 external: C:\WINDOWS\Fonts\tahoma.ttf
Code3de9                                       TrueType          yes no  no            8  0 embedded
Code128                                        TrueType          yes no  no           10  0 embedded
KIXBarcode                                     TrueType          yes no  no           12  0 embedded
SegoeScript                                    TrueType          yes no  no           14  0 embedded

So I hacked the code a little to add the random characters. Of course they should not collide with other fonts and I haven't implemented that but with this code they do show as Subset.

var
    Prefix: AnsiString;
//...
if CreateFontPackage(pointer(ttf),ttfSize,
    SubSetData,SubSetMem,SubSetSize,
    usFlags,ttcIndex,TTFMFP_SUBSET,0,
    TTFCFP_MS_PLATFORMID,TTFCFP_DONT_CARE,
    pointer(Used.Values),Used.Count,
    @lpfnAllocate,@lpfnReAllocate,@lpfnFree,nil)=0 then begin
  // subset was created successfully -> save to PDF file
  SetString(ttf,SubSetData,SubSetSize);
  FreeMem(SubSetData);

  // CleanUpSubsetTTFTables(TTF); // working on this, see future topic

  //---------
  Prefix := '';
  if System.RandSeed = 0 then Randomize; // only call when needed
  for i := 1 to 6 do Prefix := Prefix + Chr(65 + Random(26));
  Prefix := Prefix + '+';
  if fFontDescriptor.ValueByName('FontName') <> nil then
    TPdfName(fFontDescriptor.ValueByName('FontName')).Value := Prefix + TPdfName(fFontDescriptor.ValueByName('FontName')).Value;
  if Data.ValueByName('BaseFont') <> nil then
    TPdfName(Data.ValueByName('BaseFont')).Value := Prefix + TPdfName(Data.ValueByName('BaseFont')).Value;
  //---------

end;

Result (Adobe reader also shows it correctly now):

S:\pdfs\xpdf-tools-win-4.02\bin32>pdffonts -loc "c:\temp\test3.pdf"
name                                           type              emb sub uni prob object ID location
---------------------------------------------- ----------------- --- --- --- ---- --------- --------
Tahoma                                         TrueType          no  no  no            6  0 external: C:\WINDOWS\Fonts\tahoma.ttf
OQRQQB+Code3de9                                TrueType          yes yes no            8  0 embedded
UXMNNE+Code128                                 TrueType          yes yes no           10  0 embedded
VXAITD+KIXBarcode                              TrueType          yes yes no           12  0 embedded
CAOSJQ+SegoeScript                             TrueType          yes yes no           14  0 embedded

I'm sure this bit of code can be much approved upon when officially integrated (or done in a completely other way)

Last edited by rvk (2022-04-28 14:30:47)

ab · 2022-04-28 15:56:45

Perhaps https://synopse.info/fossil/info/8d158c3f61 is good enough.

Since we have a single subset per font, we can reuse the very same non-random prefix.

rvk · 2022-04-28 16:11:29

It does not seem to give a real error . I tried is with just 'ABCDEF+' for every font/subset and that worked too.

But the documentation states:

The tag consists of exactly six uppercase letters; the choice of letters is arbitrary, but different subsets in the same PDF file must have different tags.

Or do you think they mean with different subsets, multiple subsets from the same font? That wouldn't really make much sense.

I need to find some tool online which check the validity of the PDF to make sure (but all the generators I've seen use really random characters for all subsets).

Edit: The original PDF from SynPDF (so all PDF's) give an error for this online validator https://www.datalogics.com/products/pdf … f-checker/ (Ghostscript ones are fine)

Edit #2: Ah. 1.3 files are fine.

Edit #3: This way you could also potentially have trouble if merging multiple SynPdf files with the same font but different subsets, I think

Last edited by rvk (2022-04-28 16:35:48)

ab · 2022-04-28 19:58:45

I guess it is about the merger to ensure the subsets are compatible.
When using our generator, each font is processed exactly once, so it is safe to use "SUBSET+" as prefix.

rvk · 2022-04-28 20:14:11

I meant merging with another pdf with another tool which also merges embedded subset fonts.

But I'm not sure exactly why this requirement for random prefix is there.

It says random and unique over all subsets. Not just subsets per font. (It also says clearly that the prefix tag should be unique, so not the combination of prefix with fontname).

But I'll use it like this for now.
If I find a more clear source that this would be against specifications I will let you know.

ab · 2022-04-29 12:46:14

Please check https://github.com/synopse/mORMot/commi … a661c973e9

rvk · 2022-04-29 22:04:24

I think there was still a small error in you previous change.
You had this:

// see 5.5.3 Font Subsets: begins with a tag followed by a +
TPdfName(fFontDescriptor.ValueByName('FontName')).AppendPrefix;
TPdfName(fFontDescriptor.ValueByName('BaseFont')).AppendPrefix; // <---- this line

But I think BaseFont isn't part of fFontDescriptor but of Data.
So it should be

TPdfName(Data.ValueByName('BaseFont')).AppendPrefix;

Otherwise BaseFont isn't found and isn't changed with prefix. And BOTH FontName AND BaseFont need to be prefixed.

(Adobe reader does show it as Subset when only FontName is prefixed but pdffonts.exe does not show it as subset.)

This was the result if BaseFont is not prefixed.

S:\pdfs\xpdf-tools-win-4.02\bin32>pdffonts -loc c:\temp\test2.pdf
name                                           type              emb sub uni prob object ID location
---------------------------------------------- ----------------- --- --- --- ---- --------- --------
Tahoma                                         TrueType          no  no  no            6  0 external: C:\WINDOWS\Fonts\tahoma.ttf
Code3de9                                       TrueType          yes no  no            8  0 embedded
Code128                                        TrueType          yes no  no           10  0 embedded
KIXBarcode                                     TrueType          yes no  no           12  0 embedded
SegoeScript                                    TrueType          yes no  no           14  0 embedded

When I change the line to

TPdfName(Data.ValueByName('BaseFont')).AppendPrefix;

I get

S:\pdfs\xpdf-tools-win-4.02\bin32>pdffonts -loc c:\temp\test2.pdf
name                                           type              emb sub uni prob object ID location
---------------------------------------------- ----------------- --- --- --- ---- --------- --------
Tahoma                                         TrueType          no  no  no            6  0 external: C:\WINDOWS\Fonts\tahoma.ttf
NFHHHJ+Code3de9                                TrueType          yes yes no            8  0 embedded
DABIJH+Code128                                 TrueType          yes yes no           10  0 embedded
ANHKBL+KIXBarcode                              TrueType          yes yes no           12  0 embedded
DCMDKN+SegoeScript                             TrueType          yes yes no           14  0 embedded

And that one seems to be correct.

BTW. Nice idea to use random32 and just snip 4 bits off each time for the random 6 letters

(With svn update your changes did get merged nicely locally with my already changed fEmbeddedSubsetCleanup changes. Never tried that before )

Last edited by rvk (2022-04-29 22:05:03)

ab · 2022-04-30 20:05:21

You are right!

Please try https://github.com/synopse/mORMot2/commit/c857b693

rvk · 2022-04-30 21:29:38

ab wrote:

You are right!
Please try https://github.com/synopse/mORMot2/commit/c857b693

I'm using SynPDF. Not mORMot2. And SynPdf hasn't changed for me yet on github (still at revision 214).
I'll try again tomorrow and/or monday.

(The ReduceTTF is already in SynPdf trunk (r213) and seems to do a good job ).

rvk · 2022-05-01 14:50:39

Besides that the BaseFont still needs to be prefixed I also noticed something else with CID embedding.

In Adobe reader those embedded CID fonts are marked as SubSet but with pdffonts they are not.

S:\pdfs\xpdf-tools-win-4.02\bin32>pdffonts -loc c:\temp\test2.pdf
name                                           type              emb sub uni prob object ID location
---------------------------------------------- ----------------- --- --- --- ---- --------- --------
Arial                                          TrueType          no  no  no            6  0 external: C:\WINDOWS\Fonts\arial.ttf
Arial,BoldItalic                               TrueType          no  no  no            8  0 external: C:\WINDOWS\Fonts\arialbi.ttf
GJGBGN+Code3de9                                TrueType          yes yes no           10  0 embedded
LJFFNH+Code128                                 TrueType          yes yes no           12  0 embedded
LOPGJG+KIXBarcode                              TrueType          yes yes no           14  0 embedded
EAMHGK+SegoeScript                             TrueType          yes yes no           16  0 embedded
SegoeScript                                    CID TrueType      yes no  yes          18  0 embedded          // <---- this one is subset = no ??

(That last one doesn't get marked as subset)

Maybe they also need to be prefixed.
I'm not sure why it all seems to works correctly with even those fontnames not named correct but I thought I would mention it.

CID is a subset, isn't it? (Adobe says so)

rvk · 2022-05-16 08:41:29

rvk wrote:

ab wrote:
You are right!
Please try https://github.com/synopse/mORMot2/commit/c857b693
I'm using SynPDF. Not mORMot2. And SynPdf hasn't changed for me yet on github (still at revision 214).
I'll try again tomorrow and/or monday.
(The ReduceTTF is already in SynPdf trunk (r213) and seems to do a good job ).

FYI (and reminder). Changes of mormot.ui.pdf.pas from Revision 3316 haven't made it to SynPDF trunk yet.

(It still uses fFontDescriptor.ValueByName('BaseFont') instead of Data.ValueByName('BaseFont'))

Also, the subsetting of CID embedded fonts are also not prefixed (see post above).

Not a problems for me, but I thought I mention it before the changes grow further apart
(if this is something that usually takes more time you can forget this reminder)

ab · 2022-05-16 11:55:17

About mORMot 1 backport of Data instead of fFontDescriptor bug:
https://synopse.info/fossil/info/a5e5d4c449

About subsetting of CID embedded fonts, I tried to fix it with the following:
https://synopse.info/fossil/info/1ba26a0223
but it was not successful. At least the 'BaseFont' name match for the Ansi and CID fonts, and for the descriptor.
I don't know what is required for CID fonts - I couldn't find anything in the official PDF reference manual.

rvk · 2022-05-16 12:12:03

ab wrote:

About mORMot 1 backport of Data instead of fFontDescriptor bug:
https://synopse.info/fossil/info/a5e5d4c449

Yes, that seems to do the trick now.

ab wrote:

About subsetting of CID embedded fonts, I tried to fix it with the following:
https://synopse.info/fossil/info/1ba26a0223
but it was not successful. At least the 'BaseFont' name match for the Ansi and CID fonts, and for the descriptor.
I don't know what is required for CID fonts - I couldn't find anything in the official PDF reference manual.

It's not a critical thing I guess. Adobe reader does mark both the TT and CID as Subset.
But Adobe doesn't really use the 6 random characters to differentiate between Subset and Full (it uses something else).

The pdffont gives this

S:\pdfs\xpdf-tools-win-4.02\bin32>pdffonts c:\temp\test3.pdf
name                                           type              emb sub uni prob object ID
---------------------------------------------- ----------------- --- --- --- ---- ---------
Arial                                          TrueType          no  no  no            6  0
Arial,BoldItalic                               TrueType          no  no  no            8  0
ELKHBC+Code3de9                                TrueType          yes yes no           10  0
PPIMAN+Code128                                 TrueType          yes yes no           12  0
JIGIKN+KIXBarcode                              TrueType          yes yes no           14  0
OMPCCH+SegoeScript                             TrueType          yes yes no           16  0
SegoeScript                                    CID TrueType      yes no  yes          18  0

So the Subset TT are now indeed seen as sub.
The CID is not marked as sub although in Adobe it does.

So, it probably isn't a critical thing (it all seems to work correctly).
If I find more exact official documentation about CID I'll let you know.

Thanks.

mORMot Open Source

#1 2022-04-28 14:23:32

Embedded fonts with Subset are not marked as Subset in PDF

#2 2022-04-28 15:56:45

Re: Embedded fonts with Subset are not marked as Subset in PDF

#3 2022-04-28 16:11:29

Re: Embedded fonts with Subset are not marked as Subset in PDF

#4 2022-04-28 19:58:45

Re: Embedded fonts with Subset are not marked as Subset in PDF

#5 2022-04-28 20:14:11

Re: Embedded fonts with Subset are not marked as Subset in PDF

#6 2022-04-29 12:46:14

Re: Embedded fonts with Subset are not marked as Subset in PDF

#7 2022-04-29 22:04:24

Re: Embedded fonts with Subset are not marked as Subset in PDF

#8 2022-04-30 20:05:21

Re: Embedded fonts with Subset are not marked as Subset in PDF

#9 2022-04-30 21:29:38

Re: Embedded fonts with Subset are not marked as Subset in PDF

#10 2022-05-01 14:50:39

Re: Embedded fonts with Subset are not marked as Subset in PDF

#11 2022-05-16 08:41:29

Re: Embedded fonts with Subset are not marked as Subset in PDF

#12 2022-05-16 11:55:17

Re: Embedded fonts with Subset are not marked as Subset in PDF

#13 2022-05-16 12:12:03

Re: Embedded fonts with Subset are not marked as Subset in PDF

Board footer