Reference checking AI prompt

The prompt below is the product of much work and many iterations to get AI to read and process a pdf, extract all of the signposts (index tables, page references, etc) and check that they point to the correct page.

ChatGPT and Claude are not able to correctly interpret the type of pdf documents I have been working with (generated by Adobe InDesign), but myaidrive.com can read them. I don’t know why – just take it from me that some AI can do it and others cannot.

The prompt below is about eight pages long, and is free for everyone to use. It is honed for an Annual Report, but easily changed (by AI) to work for a prospectus or regulatory document. I hope you find it useful.


You will receive a PDF of a company’s Report & Accounts. Your job is to read the entire PDF carefully and verify that every page number referenced anywhere in the document is accurate.

Scope & Definitions

  • Contents / Table of Contents (TOC): Any page(s) listing sections with page numbers (including “Contents”, “Table of contents”, front index pages that look like contents grids or panels, Financial statements contents, Notes contents, “List of tables/figures”).
  • Index: Alphabetical subject index near the back, and any “Index of terms/definitions”.
  • Signpost: Any in-text cross-reference such as “see page X”, “refer to page X”, “(page X)”, “see Note Y on page X”, or variants (“p.”, “pp.”, ranges “23–25”, lists “23, 27, 31”), “Read more … on pages …”, “… found on page …”, and panel CTA patterns where a short subject line is followed by “page N”.
  • Right page (verification target): The first page whose primary (highest-level) heading, or table/figure label, or a clearly titled graphic (in that priority order) matches the subject of the reference.
  • No follow-ups: If anything is uncertain, make your best determination and record the uncertainty in Reason/Evidence.
  • Always report both page references:
    1. PDF page index (0-based), and
    2. Document page label (the printed label on the page: roman i, ii, iii…; arabic 1, 2, 3…; appendix labels like A-1).
      Do not rely only on bookmarks/outlines—use the page content.

Phase 0 — Page Labels, OCR & Text Normalisation

Goal: Build accurate page-label maps and a unified text layer before any parsing.

0A — Page-label detection (Header-first, then Footer)

Many R&As print page numbers at the top. For every page:

  1. Header OCR (top 25%)
    • OCR the top band; extract candidate 1–3 digit tokens.
    • Ignore 4-digit year-like tokens and “2024/25”-style fractions.
    • Prefer the last viable 1–3 digit token in the header as the page label.
  2. Footer OCR (bottom 25%) (fallback)
    • Same rules. Prefer patterns like “… Annual Report 2024/25 <page>”; capture the final 1–3 digit token as the label.
    • If two numeric tokens appear, treat 1–3 digit tokens as candidates; ignore 4-digit tokens unless appendix style (e.g., A-12) is present.
  3. Label confidence gating
    • If the label is low confidence/ambiguous, set Doc Label = “uncertain”. Keep the page; never drop it from downstream steps. Explain your inference later in Reason/Evidence.
  4. Optional anchor mapping (when provided or obvious)
    • If the user supplies an anchor (e.g., doc page 1 = PDF page 3), apply this as a hard offset for arabic pages (label = index – offset). Keep OCR labels too; note the rule in the Summary and Evidence.

Build & retain both maps (do not collapse duplicates):

  • PDF index → Doc label (every page), and
  • Doc label → first PDF index (for navigation).

0B — Image-only detection & unified text

  • Flag pages with near-zero extractable text (vector text, graphics, scanned).
  • For those pages (and for front index/contents-like pages), build a unified text layer by:
    • Rasterising the page at multiple DPIs (300/450/600),
    • Applying OCR to each variant, and
    • Merging OCR with any native text (preserve reading order, deskew if needed, handle multi-column layouts, preserve words; de-hyphenate as you join wraps).

0C — Enhanced recovery for graphics/“vector text” contents grids (e.g., doc page 1)

When a TOC/front-index page is mostly vector artwork:

  1. Rasterisation: Render at 500–600 dpi (RGB).
  2. Pre-processing (per full page and per column crop):
    • Convert to grayscale.
    • Generate variants:
      • Contrast boost (≈1.6–2.0) and sharpness (~1.4–1.8),
      • Global thresholds (e.g., 150, 185) → binary images,
      • Light Gaussian blur (0–2 px) followed by thresholding for line cleanup.
    • Crop likely columns: left/right halves, and (if needed) thirds; optionally top/bottom halves if the content sits high/low.
  3. Line-level OCR: Use an engine that supports line grouping (e.g., Tesseract image_to_data) with PSM 4 and 6. Group by block/paragraph/line to reconstruct lines in reading order. Deduplicate by text-normalized lowercase.
  4. Parse recovered lines into TOC entries (dedup first):
    • Dot-leader / spaced TOC lines:
      ^(subject)…(label)$ → trailing 1–3 digit label
    • Same-line page form:
      subject\s+(page|pg\.|pp\.)\s+(\d{1,3})$
    • Dashed page form:
      subject\s*[–—-]\s*(page|pg\.|pp\.)\s+(\d{1,3})$
    • Panel CTA (two-line): a line that looks like a subject (≤160 chars, not purely numeric) immediately followed by a line page N
    • Tail-number tolerance: subject … \d{1,3} only when the previous or next line contains page N
    • Clean: remove dot leaders, collapse whitespace, de-hyphenate wrap splits.
  5. Emit each pair as a TOC row with Match Basis = “TOC OCR enhanced” and a short note (e.g., “panel:next-line”, “regex:same-line”).

Never drop a reference because label/OCR is uncertain. Extraction is decoupled from verification.


Phase 1 — TOC/Index Detection & Extraction (tight parser for this layout)

Goal: pull every single TOC/index entry, including wrapped lines, sub-TOCs for Financial Statements & Notes, and front-index items.

TOC detection (tightened)

  • Identify TOC blocks via any of:
    • A TOC title (“Contents”, “Table of contents”, “Financial statements — contents”, “Notes to the financial statements — contents”), or
    • Front index pages that behave like TOC (≥5 lines ending with a page label and/or showing dot leaders / “… page N” patterns), even if not titled “Contents”, or
    • High density of lines ending in a number or roman numeral.
  • Wrapped lines handling:
    A line with no page label that shares indentation/font with the prior entry is a continuation; join it to the prior (remove hyphenation; otherwise single space).
  • Hierarchy detection:
    Infer level by indentation, numeric prefixes (1., 1.1, A., Note 7), typographic emphasis (bold/small caps), and section headers (e.g., Financial statements, Notes to the financial statements).
    Store a Hierarchy Path (e.g., Financial statements › Primary statements › Consolidated income statement).
  • Sub-TOCs (Financial Statements & Notes):
    Treat dedicated contents page(s) for Financial Statements and for Notes as separate TOCs. Extract every entry, including every individual Note line (e.g., “Note 7: Intangible assets … 132”).

Index extraction (A–Z)

  • Capture every entry and sub-entry, with the exact entry text, and every page number/range listed.
  • Expand ranges (e.g., 23–25) and lists (23, 27, 31) into one row per page.

Phase 2 — Signpost Discovery (grammar-tuned; newline-tolerant; full quote preserved)

Scan the entire unified text (native + OCR). Detect all of the following (case-insensitive; allow punctuation/parentheses; tolerant of line breaks):

Canonical patterns

  • \b(see|refer to|turn to|consult|as set out on|as described on)\s+(?:page|pages|p\.|pp\.)\s+((?:[ivxlcdm]+|\d+)(?:\s*(?:–|-|to)\s*(?:[ivxlcdm]+|\d+))?(?:\s*(?:,|and)\s*(?:[ivxlcdm]+|\d+))*)\b
  • \bsee\s+note\s+(\d+[A-Za-z]?)\s+(?:on\s+)?(?:page|p\.)\s+([ivxlcdm]+|\d+)\b
  • Parenthetical (with trailing words allowed):
    \(\s*see\s+(?:note\s+\d+[A-Za-z]?\s+on\s+)?(?:page|pages|p\.|pp\.)\s+[ivxlcdm\d]+(?:\s*(?:–|-|to)\s*[ivxlcdm\d]+)?(?:\s*(?:,|and)\s*[ivxlcdm\d]+)*[^)]*\)

Additional/modern phrasings

  • Please refer (optional “please”):
    (?is)\b(?:please\s+)?(?:see|refer\s+to|turn\s+to|consult)\s+(?:page|pages|p\.|pp\.)\s+…
  • Read-more CTAs (newline/column tolerant):
    (?is)read\s+more(?:\s+in|\s+about|\s+on)?\s+[A-Za-z0-9’’&\-\s,()/]{0,160}?\s*(?:on|at)?\s*(?:page|pages|pp?\.?|pg\.?|pgs\.?)\s+[0-9ivxlcdm\s,\-–to]+
  • Found-on phrasing:
    (?is)(?:can\s+be\s+)?found\s+on\s+(?:page|pages|pp?\.?|pg\.?|pgs\.?)\s+[0-9ivxlcdm\s,\-–to]+
  • Panel CTA (two-line):
    A short, non-numeric subject line (≤80–160 chars), immediately followed by page N.
  • Index-style same-line:
    ^(?P<subject>.{2,160}?)\s+(?:page|pg\.?|pp\.)\s+(?P<num>\d{1,3})$

Subject extraction (and what to store)

  • Signposts: store the full quoted signpost text exactly as it appears (normalize whitespace; keep punctuation).
  • If the signpost appears inside a sentence (“See page 45 for the remuneration policy”), the subject for verification is the object/NP (“remuneration policy”).
  • If “see Note X on page Y”, include both the note id and verify the note title (from Notes TOC or the note page).
  • TOC/Index: store the visible entry text with dot leaders removed and wraps fixed.

Ranges/lists

  • Expand ranges (e.g., pp. 53–54) into 53 and 54.
  • Expand lists (pages 10, 12, 15) into one row per page.

Phase 3 — Target Resolution & Verification (strict, label-first)

Always check the stated page first before scanning around.

  1. Navigate by document label (Phase 0 map). If resolved, verify on that exact page in priority order:
    • Top-level heading exact/near-exact match to the subject, else
    • Table/Figure label, else
    • Graphic title/caption.
  2. Title-page adjacency heuristic:
    If the cited page is a title/facing/blank page and the body starts on the next page, set Correct? = False and set Target Page Found to where the content actually begins; explain in Reason/Evidence.
  3. If the label cannot be resolved:
    • Try ±2 adjacency only (n–2, n–1, n+1, n+2). If used, set Match Basis: “… + adjacency” and explain the anchor in Reason/Evidence.
    • Do not make long jumps (e.g., never drift 18 → 183).
  4. Content search fallback (last resort):
    If neither label nor adjacency resolves the target, perform a direct content search for the subject and pick the first occurrence. Mark Match Basis: similarity/content search and explain.
  5. Ranges: Confirm the start and whether the content spans the range; note this in Reason/Evidence.
  6. Front matter / main matter / appendices: Reconcile roman numerals, arabic numbers, and appendix labels (A-1, etc.). Where labels are missing/uncertain, state the inference.

Phase 4 — Coverage Back-check & Re-verification Loop

  • Coverage back-check: After extraction, run a raw regex scan over the entire text for page refs (classic, parenthetical, read-more, found-on, panel, index-style).
    Any hits not already captured are appended with Reason/Evidence = “Added via coverage back-check.”
  • Re-verification loop: Re-run verification using the final cleaned subjects and the refreshed label map.
    If any item’s status or target changes (Unknown → True/False; corrected page), record counts in the Summary.

Output Format (exhaustive; include every reference)

Produce comprehensive Markdown table(s) and a CSV block. Use exactly these columns:

  1. Reference Type — TOC, Index, Signpost
  2. Source Page (PDF Index) — integer (0-based) where the reference appears
  3. Source Page (Doc Label) — printed page label (e.g., “ii”, “1”, “A-2”); “uncertain” if not confidently readable (explain later)
  4. Reference Text / Subject — TOC/Index entry text or the full quoted signpost text (normalized after wrap-joining)
  5. Target Page Stated (Doc Label) — the page label as written (single page, range, or list)
  6. Target Page Found (PDF Index) — 0-based index where the correct content actually begins
  7. Target Page Found (Doc Label) — printed label where the content actually begins
  8. Match BasisTop-level heading, Table/Figure label, Graphic title/caption, Label navigation, … + adjacency, similarity/content search, or TOC OCR enhanced
  9. Correct? — True or False
  10. Reason / Evidence — short justification (e.g., “Heading ‘Directors’ report’ on stated page”, “Stated 110; heading begins 111”, “Label 18 resolved; ‘Assurance’ heading present”, “Added via coverage back-check”, “Recovered via enhanced OCR (panel next-line)”)

Formatting requirements

  • List every single entry from all TOCs (incl. Financial Statements & Notes sub-TOCs) and every Index entry/sub-entry.
  • For ranges, create one row per page cited; note coverage in Evidence.
  • Preserve exact entry text (minus dot leaders), fix wraps, and remove hyphenation where a word was split across lines.
  • If a page shows two labels (e.g., unnumbered title page followed by “1”), explain your decision in Reason/Evidence.
  • If a TOC/Index says “see also …” without a page number, omit (not a page reference).

Deliverables

  1. Markdown table(s) with all rows and the schema above.
  2. CSV (one code block) with the same rows/columns, ready to copy/download.
  3. Summary:
    • Total references checked; True vs False counts.
    • Notable systemic issues (e.g., “Notes off by +1”, “appendix labels mismapped”, “header-only page labels OCR-only”, “front index recovered via enhanced OCR”).
    • Re-verification diff: count of rows whose status/target changed after OCR/tight parsing.
    • Count of uncertain page labels and how many references were verified via adjacency or similarity/content search rather than by label.

(Optional for auditing: include a separate, supplementary audit table with “Label Confidence” and “Found Via (doc-label | adjacency | similarity-search | coverage-backcheck | TOC-OCR)”; do not add extra columns to the main CSV.)


Extra robustness & edge cases

  • Minor punctuation/case differences count as matches.
  • Prefer the first occurrence of the correct top-level heading.
  • If a page shows a title only and the body begins on the next page, validate against where the section actually begins (mark the cited title page False, per spec).
  • Multi-column TOCs/Indexes: scan both columns; don’t skip the final short column.
  • Figures/tables with embedded page references (“see page …”) count as Signposts.
  • If any step is uncertain, make your best determination and record the uncertainty in Reason/Evidence (no follow-ups).

Acceptance tests (must all pass)

  1. Header labels: Page numbers printed at the top are correctly mapped via header OCR; year/fraction tokens (e.g., 2024/25) are ignored as labels.
  2. Graphic contents grid (doc page 1): All visible entries are recovered via enhanced OCR (multi-DPI + column crops + panel pairing) and emitted as TOC rows; verification uses label navigation.
  3. Front-index patterns: Lines like “Investors — page 79” (or on two lines) are captured; same for “Read more … page 25”.
  4. Parenthetical & “please refer”:(see page 18 for full definition)” and “Please refer to page 58 …” are captured with the full quote and verified on the stated labels.
  5. No drift: When a signpost cites page 18, verification never jumps to 183; only ±2 adjacency is allowed and must be noted if used.
  6. Off-by-one TOC:Independent auditors’ report … 95” where content actually begins on 94Correct? = False, with Evidence explaining the off-by-one and the actual start page.
  7. Coverage completeness: A raw back-check adds any missed signposts (classic, parenthetical, read-more, found-on, panel, index-style) and marks them “Added via coverage back-check.”