Recently an issue came up with regard to the export to PDF functionality in our printing and reporting suites, XtraPrinting and XtraReports. The issue was in regard to the use of characters that weren't the normal 7-bit ASCII characters and whether the fonts should be embedded or not in that case.
I decided to take a peek at how we've implemented the export to PDF functionality in order that you can gain a better understanding of the issues. Since the whole subject is pretty complicated, I took the hit so that you didn't have to .
With PDF, there is a fundamental distinction between the notion of a character and a glyph. A character is a symbol, like "A" or "4", whereas a glyph is an image or rendering of that character. A collection of glyphs is known as a font.
Easy enough, except that a character has to be encoded in some way into a binary number. We're used to thinking of the character "A", for example, as being encoded as 0x41. This particular encoding started life in ASCII and has now propagated into Unicode.
That's all very well, but in the days when a character was encoded in a byte value, there weren't enough bit values available to encode all possible characters. So the notion of codepages evolved to encode different characters in the range 128 to 255. In the Latin codepage or character set (sometimes known as Latin-1 or Windows 1252), for example, the character à is encoded as 0xE0 (and, again, that propagated to Unicode). However, on the Mac the character à is encoded as 0x88.
Back to PDF. PDF is a file format that is essentially text and not binary. (Yes, I'm oversimplifying since text blocks can be compressed using the Deflate algorithm and will appear as binary blobs, but bear with me.) The text is obviously represented using some encoding. There are a set of standard encodings for text in PDFs: one for Macs called MacRomanEncoding, one for "Windows ANSI" (which is essentially codepage 1252) called WinAnsiEncoding, and one for a more general PDF codepage called PDFDocEncoding. All of these encodings are single byte encodings: each byte value represents a different character.
Back to fonts in PDFs (as you can see, there are lots of strands to pull together here to get the full tapestry). There are two different ways to define the fonts in a PDF. The first, and very lightweight way, is to describe the font as a set of metrics (name, width of glyphs, slope of italics, and so on). The reader of the PDF (say, Acrobat Reader) is then responsible for locating the font on the user's machine and using it. If the actual font is not available on the user's machine, the reader then has to locate the nearest font that matches the font metrics embedded in the PDF.
If the PDF uses fonts in this way, the text in the PDF is encoded as one of the standard encodings. As you can imagine, you are relying on the user machine being pretty similar to the machine that generates the PDF, otherwise the user is possibly going to get some weird effects (wrong or missing glyphs, a different look to the page, and so on).
In XtraPrinting and XtraReports, we used to use PDFDocEncoding in this situation. However, the majority of our customers use the Latin-1 codepage, and so there was a possibility that reports could have some missing or invalid glyphs when exported as a PDF for these customers. For the next minor version (2008.2.3), we've switched to WinAnsiEncoding and this change should help more people.
Of course, you may be thinking that it's nice to generate very small PDFs in this way, but it's all a little bit too hit and miss on the reader side. That's why there is a second way of defining fonts in PDFs: to embed them.
Here the onus is on the writer of the PDF. It has to analyze the text in the PDF, work out which glyphs of the font are being used and then embed those glyphs directly in the PDF. If fonts are embedded in the PDF in this way, the text is actually encoded in a two-byte manner and a map is generated that maps the character encoding to the index of the glyph in the embedded font.
Using fonts in this way, you get absolute precision control. What You See (as the writer) Is What You Get (as the reader). There is no wishy-washy, crossed-fingers, hope-for-the-best aspect to reading the PDF: the end-user will see exactly what you wanted them to see, no dropped or swapped glyphs. The downside to this is, obviously, the PDF is "heavier" or larger since it has to have all those glyphs embedded.
(Note to those who are really clued up on PDFs and character encodings and font support. There is yet another method: the writer can define a ToUnicode character map, or CMap, that is a single-byte encoding that uses a special map to work out which glyph to use with a font that's not embedded. We don't support this variant yet.)
Having said all that, you can mix and match. You can have some text in your PDF where you assume that the reader will have the required fonts available to display it, and you can have some text where you embed the fonts. In our printing suites, you can manage this scenario by using the NeverEmbeddedFonts property. If you name a font in the NeverEmbeddedFonts property (it's a semicolon-separated list of font names) it will be defined in the PDF in the first, lightweight, way. If a font is used in the PDF that is not in this list, then it will be embedded in the PDF in the second, heavier, way.
And that's pretty much it. I hope that this exposition has made the whole issue of fonts in PDFs clearer and that it helps you make the best decisions in your particular scenario when you need to export your report to PDF for your users.