Wednesday, May 20, 2020

C# Get the number of pages in a PDF document from metadata without load entire file


A popular solution is to use pdfinfo  and pump it through a command line to get the number of pages in a PDF. However, if you examine the source code, https://dl.xpdfreader.com/xpdf-4.02.tar.gz you'll see that this reads the entire PDF to determine the page count. Therefore, the load time will be indeterminate, esp. slow for large files.


Here's a way to read the page count of a PDF using the metadata in a fixed amount of time.
BUT BE ADVISED, this only works for some versions of PDF encoding versions. It does not guarantee a returned result. For that you need to read entire file and iterate over this regex - "/\/Count\s+(\d+)/".

However, this works in a preset amount of time 10ms, since it reads only the first 32,767 bytes of the file. It then matches some possible patterns for the page count. Most common are matched first. 

Please add more in the comments section. You can find patterns by inspect the PDF file for patterns.

You can inspect PDF files using Frhed a free hex/binary editor for Windows, that will open a PDF file and you can view the metadata for youself. Or test drive my Clipboard Plaintext Power Tool which has a custom version of Frhed built-in.

const string strRegexNT = @"\/N\s*(\d*)\s*\/[T|O]\s"; //Seems to be most common found and reliable
private static readonly Regex rgxNT = new Regex(strRegexNT, RegexOptions.Multiline | RegexOptions.CultureInvariant | RegexOptions.Compiled);

const string strRegexPRT = @"\/Pages\s*(\d*).*R.*\/T";
private static readonly Regex rgxPRT = new Regex(strRegexPRT, RegexOptions.Multiline | RegexOptions.CultureInvariant | RegexOptions.Compiled); 

const string strRegexTPC = @"\<\<\/Type\/Pages\/Count\s*(\d*)\s*\/Kids";
private static readonly Regex rgxTPC = new Regex(strRegexTPC, RegexOptions.Multiline | RegexOptions.CultureInvariant | RegexOptions.Compiled);

const string strRegexEndObj = @"endobj\s*7\s*0\s*obj\s*\<\<\s*\/Count\s*(\d*)\s*\/";
private static readonly Regex rgxEndObj = new Regex(strRegexEndObj, RegexOptions.Multiline | RegexOptions.CultureInvariant | RegexOptions.Compiled);

/// <summary>
/// Gets number of PDF pages reading only 1st 32767 bytes in 10ms, should cover most cases
/// https://metadataconsulting.blogspot.com/2020/05/C-Get-the-number-of-pages-in-a-PDF-document-from-metadata-without-load-entire-file.html
/// </summary>
/// <param name="fileName"></param>
/// <returns></returns>
public static string GetNumofPdfPages(string fileName)
{
    string o = string.Empty;
    string head = string.Empty;
    Match m;

    try //for Openread
    {
        using (BinaryReader br = new BinaryReader(File.OpenRead(fileName)))
        {
            head = Encoding.UTF8.GetString(br.ReadBytes(Int16.MaxValue)); //32767
        }
    }
    catch { }
    
    if (!string.IsNullOrEmpty(head))
    {

        m = rgxNT.Match(head);
        if (m.Groups.Count == 2)
            o = m.Groups[1].Value;

        if (string.IsNullOrEmpty(o))
        {

            m = rgxPRT.Match(head);
            if (m.Groups.Count == 2)
                o = m.Groups[1].Value;

        }

        if (string.IsNullOrEmpty(o))
        {

            m = rgxTPC.Match(head);
            if (m.Groups.Count == 2)
                o = m.Groups[1].Value;

        }
       

        if (string.IsNullOrEmpty(o)) {

            m = rgxEndObj.Match(head);
            if (m.Groups.Count == 2)
                o = m.Groups[1].Value;

        }
    
    }
    return o;
}

1 comment: