I have a large volume (1,000+) of newspaper scans which are usually in pairs, one JPG with the article, another JPG with the masthead for that issue.
Here is an example of the naming conventions:
Miami_Tribune_1925_08_31_1.jpg
Miami_Tribune_1925_08_31_4.jpg
The_Tampa_Tribune_1923_03_25_1.jpg
The_Tampa_Tribune_1923_03_25_34.jpg
The_Tampa_Times_1925_09_04_1.jpg
The_Tampa_Times_1925_09_04_20.jpg
Usually I would simply (and manually) combine each pair of JPGs into a single PDF, crop the article from one page and then crop the masthead from the other page, and save the new PDF and be done.
But with this large volume of files, I need to be able to roll them all up into one big PDF file, so that I can do a ton of the editing within the larger file, before I split each edited pair back into its own two-page PDF.
Here’s where I want to pick your brain: the embedded dates in the original JPG file names (ex: 1925_08_31) are very important and I want to keep that with the PDF which is generated when I combine each page into the big PDF. Doing so will help remind/confirm for me that I am working on the right pair of pages. Maybe that date value is put into the PDF metadata, or each page gets an automatically generated bookmark with the extracted date, etc.
But more importantly, if the date from the original filename is embedded within each page of the combined PDF, I’d like to then use that metadata to name the extracted two-page PDF once I am done editing.
So that ideal workflow would be:
(1) I combine all six JPG files above into a single PDF file.
(2) The date portion from each JPG filename is somehow associated with each PDF page. Each pair of related PDF pages in this example would have the same date.
(3) I go through and manually edit the full PDF, cropping etc. as needed.
(4) When I am finished, then I run an automated process to a) extract each pair of PDF pages as a single standalone PDF, b) named with the date originally extracted from the JPG filenames.
In this example, at the end of my workflow, there would be three PDF files, each with two edited pages, automatically named as follows:
Miami_Tribune_1925_08_31.pdf
The_Tampa_Tribune_1923_03_25.pdf
The_Tampa_Times_1925_09_04.pdf
How much of this, if any, can PDFXchange or PDF Tools do for me? I am also open to hearing about other tools you may know about which may be able to pick up one/more of these steps.
Thanks!
Embed JPG filename substring so that it is referenced later on PDF extraction
Moderators: PDF-XChange Support, Daniel - PDF-XChange, Chris - PDF-XChange, Sean - PDF-XChange, Vasyl - PDF-XChange, Stefan - PDF-XChange
-
bssmith
- User
- Posts: 3
- Joined: Sat Nov 22, 2025 12:55 am
-
Vladimir G - Tracker Dev
- User
- Posts: 92
- Joined: Thu Nov 30, 2017 1:24 pm
Re: Embed JPG filename substring so that it is referenced later on PDF extraction
Hello bssmith,
Unfortunately, PDF-Tools does not currently support any automatic grouping of input files. This means that it cannot use filename similarity to determine which files belong together.
However, if some preprocessing is acceptable for your workflow, I can suggest organizing your images into subfolders, for example:
Miami_Tribune_1925_08_31
├── Miami_Tribune_1925_08_31_1.jpg
└── Miami_Tribune_1925_08_31_4.jpg
The_Tampa_Tribune_1923_03_25
├── The_Tampa_Tribune_1923_03_25_1.jpg
└── The_Tampa_Tribune_1923_03_25_34.jpg
The_Tampa_Times_1925_09_04
├── The_Tampa_Times_1925_09_04_1.jpg
└── The_Tampa_Times_1925_09_04_20.jpg
This can be done manually or with a simple preprocessing script.
(Not a ready-to-use script, but rather an example illustrating the general idea.)
After that, you can use the Split/Merge PDFs tool with the following settings:
Method: All pages to one document
Root Bookmark: Add bookmarks with folder structure
This will produce a single document with a bookmark structure similar to this: Once you finish manually editing the combined document, you can use Split/Merge PDFs again to separate it back into individual PDFs using:
Method: Split by top bookmarks
Name for generated document(s): %[Bookmark]
If you would like, I can provide an example script that demonstrates the overall approach.
Best regards,
Unfortunately, PDF-Tools does not currently support any automatic grouping of input files. This means that it cannot use filename similarity to determine which files belong together.
However, if some preprocessing is acceptable for your workflow, I can suggest organizing your images into subfolders, for example:
Miami_Tribune_1925_08_31
├── Miami_Tribune_1925_08_31_1.jpg
└── Miami_Tribune_1925_08_31_4.jpg
The_Tampa_Tribune_1923_03_25
├── The_Tampa_Tribune_1923_03_25_1.jpg
└── The_Tampa_Tribune_1923_03_25_34.jpg
The_Tampa_Times_1925_09_04
├── The_Tampa_Times_1925_09_04_1.jpg
└── The_Tampa_Times_1925_09_04_20.jpg
This can be done manually or with a simple preprocessing script.
(Not a ready-to-use script, but rather an example illustrating the general idea.)
After that, you can use the Split/Merge PDFs tool with the following settings:
Method: All pages to one document
Root Bookmark: Add bookmarks with folder structure
This will produce a single document with a bookmark structure similar to this: Once you finish manually editing the combined document, you can use Split/Merge PDFs again to separate it back into individual PDFs using:
Method: Split by top bookmarks
Name for generated document(s): %[Bookmark]
If you would like, I can provide an example script that demonstrates the overall approach.
Best regards,
You do not have the required permissions to view the files attached to this post.
Vladimir
Software Developer
PDF-XChange Co. LTD
Software Developer
PDF-XChange Co. LTD
-
bssmith
- User
- Posts: 3
- Joined: Sat Nov 22, 2025 12:55 am
Re: Embed JPG filename substring so that it is referenced later on PDF extraction
That is a solid approach, and simpler/cleaner than what I feared would be needed.
Yes, I would love to see an example script that demonstrates the overall approach.
Many thanks!
Yes, I would love to see an example script that demonstrates the overall approach.
Many thanks!
-
Vladimir G - Tracker Dev
- User
- Posts: 92
- Joined: Thu Nov 30, 2017 1:24 pm
Re: Embed JPG filename substring so that it is referenced later on PDF extraction
Hello bssmith,
This example PowerShell script may not cover special or uncommon cases that were not described earlier, but it should work correctly for the filename structure shown in your example.
It is recommended to first test it on a small set of copied files, and even if everything works correctly, make a full backup before running it on your actual dataset.
The script can also be safely run multiple times in the same folder — useful if new files are added later.
How to run it:
Best regards,
This example PowerShell script may not cover special or uncommon cases that were not described earlier, but it should work correctly for the filename structure shown in your example.
It is recommended to first test it on a small set of copied files, and even if everything works correctly, make a full backup before running it on your actual dataset.
The script can also be safely run multiple times in the same folder — useful if new files are added later.
Code: Select all
# Group JPG files into folders based on filename (example script)
Get-ChildItem *.jpg | ForEach-Object {
# Get the filename without extension
$name = $_.BaseName
# Remove the final "_NN" (page number) segment
$folder = $name -replace '_\d+$', ''
# Create the folder if it does not exist
if (!(Test-Path $folder)) {
New-Item -ItemType Directory -Path $folder | Out-Null
}
# Move the file into that folder
Move-Item $_ -Destination $folder
}How to run it:
- Put all JPG files in a single folder.
- In that folder, Shift + right-click an empty area.
- Choose “Open PowerShell window here” (or “Open in Terminal”).
- Paste the script into the window and press Enter.
Best regards,
Vladimir
Software Developer
PDF-XChange Co. LTD
Software Developer
PDF-XChange Co. LTD
-
bssmith
- User
- Posts: 3
- Joined: Sat Nov 22, 2025 12:55 am
Re: Embed JPG filename substring so that it is referenced later on PDF extraction
Thanks so much, Vladimir!
-
Sean - PDF-XChange
- Site Admin
- Posts: 769
- Joined: Wed Sep 14, 2016 5:42 pm
Re: Embed JPG filename substring so that it is referenced later on PDF extraction
Sean Godley
Technical Writer
PDF-XChange Co LTD
Sales: +1 (250) 324-1621
Fax: +1 (250) 324-1623
Technical Writer
PDF-XChange Co LTD
Sales: +1 (250) 324-1621
Fax: +1 (250) 324-1623