DT3 check PDF/A scan

Hi there,

I use a Brother ADS 2800W to scan my documents, which can also generate PDF / A.
Can I see somewhere in DT3 whether this is really PDF / A?

If I use the OCR function in the DT3, does that change anything in the format (PDF / A should remain PDF / A)?

1 Like

I cannot find anything in DEVONthink or in Mac OSX, e.g. “info” in Finder, that can distinguish between PDF and PDF/A. In Preview, PDF/A documents are not editable ( as intended by the PDF/A standard, but I don’t know where that “enforced”.

Internet searching came up with tools available from Adobe and others. Probably can write your own tool if you know the content spec of the PDF/A and PDF (probably published somewhere).

Re the OCR question, I ran an experiment. I created a PDF and PDF/A of a 1-page document using a Brother ADS-1700W. Imported into DEVONthink. The PDF/A was ~19 times bigger than PDF. Both files could be OCR-ed with DEVONthink’s tool for OCR into new files. Both files with OCR info (PDF+Text as labeled in DEVONthink) of course grew in size.

Suggest you experiment yourself.

Could be some way in DEVONthink that I don’t know about. Probably some sort of code could be written (AppleScript, Python, Perl, … whatever) to be integrated with DEVONthink using the specification for PDF vs. PDF/A. But I did not pursue that. Others can comment on both.

One thing is funny.
I have created over 2 profiles in the Brother setup, once with creation of PDF and another time with PDF / A,
I used both with a shortcut.
After scanned a page with both profiles the PDF / A is slightly smaller.

I suppose that’s because the “A” stands for archive, which in particular means that fonts and graphics have to be included in the file. More on PDF/A tools here

I have no idea why you see different files sizes. But as that off topic your question, I’m not sure relevance. Maybe, but i don’t know. You can pursue. Here are my settings.

Yes, as fonts and other stuff included, nature being what it is makes the files bigger. Yes, my understanding PDF/A is to make the files as future-proof as possible in terms of presentation and security. I never use.

The background to my question is as follows:
I am currently converting my office to paperless.
I scan current all of my filed papers.

Currently I am creating pure PDF files and by chance I came across the PDF / A format.
To what extent do I need as private person PDF / A at all and does it make sense ?!

There isn’t a need for PDF/A unless there’s a requirement for it, e.g., you are working in litigation or with a governmental agency that defines the standard it accepts.

1 Like

That’s for you to decide. It might not depend on the fact that you’re a “private person” but on the kind of documents and what usage you make of them. Then the legislation of the region you’re living in or possibly sending these documents to comes into play.
In any case it has nothing todo with DT. There are websites dedicted to PDF/A that should be able to shed more light on your question(s).

1 Like

Well, somehow I had hoped that with DT I could recognize whether a PDF is a normal PDF, or a PDF / A.
In the finder, both look the same and cannot use the information in Finder to identify what is what.
Therefore my hope was that DT could bring clarity to me.

But I don’t think the discussion about PDF / A belongs here.
Nevertheless, many thanks for the many comments.

That’s what i’m looking years for. Does anyone have a solution or hint?

Background: I receive many PDF/A files from other people (Unfortunately, these are almost always not marked as PDF/A or non-PDF/A). I need to annotate these files (or a copy of them) in DT. If they are PDF/A with every first annotation the text isn’t “readable” anymore (all characters are “?”). Until now i used an automator workflow to “flatten” these files before importing them into Devonthink and annotating was possible. Another way was to save them with Adobe Acrobat. This isn’t really quick work.

It would be a great simplification if I could recognize in DT that it is a PDF/A and that I didn’t have to first check with Adobe Acrobat whether it is one.

I already tried many command line tools to extract infos from the files, but the format (PDF/X, PDF/A, …) couldn’t be read.

Thanks!

I already tried many command line tools to extract infos from the files, but the format (PDF/X, PDF/A, …) couldn’t be read.

Doesn’t that point out the difficulty of the request you’re making? :thinking: :slight_smile:

1 Like

I generated some pdf/a files using abbyy finereader, and used the command line tool pdfinfo to look at the result.

First, plain old pdf

jeremy@Typhon ~ % pdfinfo -meta pdfnoa.pdf
Creator:        ABBYY FineReader PDF For Mac
Producer:       ABBYY FineReader PDF For Mac
CreationDate:   Tue Nov 16 21:08:44 2021
ModDate:        Tue Nov 16 21:08:44 2021
Tagged:         yes
Form:           none
Pages:          42
Encrypted:      no
Page size:      515.5 x 696.5 pts (rotated 0 degrees)
File size:      4131028 bytes
Optimized:      no
PDF version:    1.5

next, a file encoded with one of the versions of pdf/a

jeremy@Typhon ~ % pdfinfo -meta pdf3.pdf  
Creator:        ABBYY FineReader PDF For Mac
Producer:       ABBYY FineReader PDF For Mac
CreationDate:   Tue Nov 16 21:03:57 2021
ModDate:        Tue Nov 16 21:03:57 2021
Tagged:         yes
Form:           none
Pages:          42
Encrypted:      no
Page size:      515.5 x 696.5 pts (rotated 0 degrees)
File size:      4288797 bytes
Optimized:      no
PDF version:    1.5
Metadata:
<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/"><rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"><rdf:Description xmlns:dc="http://purl.org/dc/elements/1.1/" rdf:about=""><dc:format>application/pdf</dc:format><dc:title><rdf:Alt><rdf:li xml:lang="x-default">Unknown Title</rdf:li></rdf:Alt></dc:title></rdf:Description><rdf:Description xmlns:pdf="http://ns.adobe.com/pdf/1.3/" rdf:about=""><pdf:Producer>ABBYY FineReader PDF For Mac</pdf:Producer><pdf:Keywords></pdf:Keywords></rdf:Description><rdf:Description xmlns:xmp="http://ns.adobe.com/xap/1.0/" rdf:about=""><xmp:CreatorTool>ABBYY FineReader PDF For Mac</xmp:CreatorTool><xmp:CreateDate>2021-11-16T21:03:57Z</xmp:CreateDate><xmp:ModifyDate>2021-11-16T21:03:57Z</xmp:ModifyDate></rdf:Description><rdf:Description xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/" rdf:about=""><xmpMM:DocumentID>uuid:00002A8F-0281-78AE-1219-662C1A4CEFBB</xmpMM:DocumentID></rdf:Description><rdf:Description xmlns:pdfaid="http://www.aiim.org/pdfa/ns/id/" rdf:about="" pdfaid:part="3" pdfaid:conformance="U"></rdf:Description><rdf:Description xmlns:pdfaExtension="http://www.aiim.org/pdfa/ns/extension/" xmlns:pdfaSchema="http://www.aiim.org/pdfa/ns/schema#" xmlns:pdfaProperty="http://www.aiim.org/pdfa/ns/property#" xmlns:pdfuaid="http://www.aiim.org/pdfua/ns/id/"><pdfaExtension:schemas><rdf:Bag><rdf:li rdf:parseType="Resource"><pdfaSchema:schema>PDF/UA Universal Accessibility Schema</pdfaSchema:schema><pdfaSchema:namespaceURI>http://www.aiim.org/pdfua/ns/id/</pdfaSchema:namespaceURI><pdfaSchema:prefix>pdfuaid</pdfaSchema:prefix><pdfaSchema:property><rdf:Seq><rdf:li rdf:parseType="Resource"><pdfaProperty:name>part</pdfaProperty:name><pdfaProperty:valueType>Integer</pdfaProperty:valueType><pdfaProperty:category>internal</pdfaProperty:category><pdfaProperty:description>Indicates, which part of ISO 14289 standard is followed</pdfaProperty:description></rdf:li></rdf:Seq></pdfaSchema:property></rdf:li></rdf:Bag></pdfaExtension:schemas><pdfuaid:part>1</pdfuaid:part></rdf:Description></rdf:RDF></x:xmpmeta><?xpacket end='w'?>

Detection of the pdf/a claim means scanning the metadata for pdfaSchema tags. Probably easiest to install xpdf (using brew, if you have that), and call pdfinfo from an applescript. Validating the claim is much harder, but @tjur doesn’t need to do that.

Great! Thanks!

With the “grep” command it is possible to filter for “pdfa”…

I created an Automator action (don’t know how to use “grep” command in that shell script with Automator, that’s why it is with “#”; in the Terminal it works):

Unfortunately, I am technically not able to write a script for Devonthink, for example to set a tag / mark or rename a PDF/A if the condition is met. It certainly works with an if-loop. Maybe someone with programming skills who is interested has some time and energy…

1 Like

I don’t know automator bit the grep part looked kind of overquoted to me. In the Shell, I’d write
grep pda
with no quotes at all.

You are right. I assumed that there was something wrong with the syntax (because I received an error as a result) and to solve that, i tried quotes. In fact, an error only appears if the text you are looking for is not occuring. Just counting is also enough:

/usr/local/bin/pdfinfo -meta "$1" | /usr/bin/grep -c -i pdfa

-i = ingnoring case
-c = count

… adding “echo”

/usr/local/bin/pdfinfo -meta "$1" | /usr/bin/grep -c -i pdfa
echo

the error is gone.

No. grep will still report an error to the shell if the string is not found. echo is only kind of masking it, because it runs without error. the proper way to deal with this is checking the shell’s error variable ($! ? I’m not sure). Return 0 if it’s set, 1 otherwise.

I will try to do that!

Something like

grep -ci pdfa ; if (( $? )) then    echo 0; else;     echo 1; fi

however, here (Monterey, zsh), grep -ci brmplf prints “0” so that the code results in two consecutive lines with 0 printed on them. I suggest that you leave out the -ci from your grep and simply check for error status, something like

pdfinfo ... | grep pdfa > /dev/null 2>&1 ; if (( $? )) then    echo 0; else;     echo 1; fi

That should simply print 1 if pdfa was found and 0 if not, no output from grep at all.

Alternatively, a JavaScript script could do

const app= Application.currentApplication();
app.includeStandardAdditions = true;
const result = app.doShellScript(`pdfinfo -meta 'filename'`);
const isPDFA = /pdfa/.test(result);

and then do whatever it wants if isPDFA is true, for example tell the program you use for that to flatten the file. The script can be part of a smart rule that handles PDFs with a word count of 0.