Finding files by Fonts included

Sarsie · December 6, 2020, 12:12pm

I need to identify which files within a set of about 15, 000 transcriptions of historic correspondence contain a specific font. The font has been used to differentiate components of a source that were printed (forms, letterheads and the like). Thus the files concerned are using more than one font.
Files are in Word saved with the “.doc” suffix.
I am using EasyFind 5.0 on a machine running OS 10.15.7

BLUEFROG · December 6, 2020, 3:08pm

Welcome @Sarsie
EasyFind doesn’t search for fonts in documents.
Also Word documents aren’t text-based, so you can’t use a contents search with them.

Sarsie · December 6, 2020, 3:24pm

Thanks for responding.

But EasyFind searches beautifully for content to find specific terms (simple or Boolean) in the corpus of 15,000 files each named “.doc”, so I may have misunderstood your “Also…”

BLUEFROG · December 6, 2020, 3:49pm

Were these .doc files created in older versions of Word?

Sarsie · December 6, 2020, 4:03pm

Yes, starting with Word 3(!!) in 1988, but they have been updated as necessary so they are now worked upon using Word for Mac, version 16.43, saved as “.doc” files.

BLUEFROG · December 6, 2020, 4:27pm

It is possible very old documents could be found. However, newer versions of Word files aren’t text-based so can’t be searched by contents.

Sarsie · December 6, 2020, 4:27pm

Your message prompted me to save a couple of file duplicates as .docx. I see what you mean.

chrillek · December 6, 2020, 5:48pm

Hm. Aren’t these just compressed archives consisting of XML files? Of course one can’t search in the original file, but after decompression…?
Update: Yes, they are:

> unzip -l Arbeitskopie_Marktuebersicht_KI-SOC.docx 
Archive:  XXX.docx
  Length      Date    Time    Name
---------  ---------- -----   ----
     1445  01-01-1980 00:00   [Content_Types].xml
      590  01-01-1980 00:00   _rels/.rels
    74133  01-01-1980 00:00   word/document.xml
     1862  01-01-1980 00:00   word/_rels/document.xml.rels
     8387  01-01-1980 00:00   word/theme/theme1.xml
     3747  01-01-1980 00:00   word/settings.xml
    14051  01-01-1980 00:00   word/numbering.xml
    33455  01-01-1980 00:00   word/styles.xml
      752  01-01-1980 00:00   word/webSettings.xml
     3683  01-01-1980 00:00   word/fontTable.xml
      749  01-01-1980 00:00   docProps/core.xml
      992  01-01-1980 00:00   docProps/app.xml
---------                     -------
   143846                     12 files

It might be interesting to have a look at word/fontTable.xml, for example

BLUEFROG · December 6, 2020, 6:05pm

Yes, but EasyFind isn’t in the business of unzipping any of the .x files Microsoft Office products produce.

rmschne · December 6, 2020, 6:10pm

Some hope there. See attached screen shot for a simple two line Word DOCX file with three fonts. Here is how the fontTable.xml looks (using BBEdit to look inside the DOCX file directly). See how the “American Typewriter” and “Calibri” are mentioned. Screenshot 2020-12-06 at 18.06.00 . Find a way to do a batch decompress and look there. Maybe will work. Dunno. Have to play with it a bit, I guess.

On firsrt try the XML didn’t come thru. See attached pdf of the XML.

<?xml version="1.0" encoding="UTF-8" standalone="yes"? [fonttable.pdf|attachment](upload://50pwg6UlUuGUjIEfprqTk6XyojG.pdf) (30.6 KB)

BLUEFROG · December 6, 2020, 6:37pm

This is way outside the purview of EasyFind but…

unzip -p -q  neumann.docx word/fontTable.xml | grep -i -o Calibri

would look for instances of Calibri in the unzipped contents. It also doesn’t muddy things up by unzipping to directories. Perhaps not perfect, depending on the font being searched for but something anyhow.

Amended to only consider the fontTable.xml file as the theme.xml may contain the same fonts.

rmschne · December 6, 2020, 6:42pm

Probably the best approach. Didn’t occur to me, but a good one.

BLUEFROG · December 6, 2020, 6:53pm

Thanks. Amended to be a bit more specific.

Sarsie · December 7, 2020, 9:42am

Thanks for this.
Much appreciated as it is outside the remit of EasyFind and this forum
I will need to ponder as it seems that I will need to start with a mass conversion of all 15, 000 files from .doc to .docx, mass decompress those, and then search for the font.
Getting beyond my technical expertise, but a good steer.

rmschne · December 7, 2020, 9:58am

Find someone in your world who knows about shell scripts in Unix and/or OSX (Apple’s operating system is based on Unix). There are many of these people but you might have to ask around. It should be a short script of a few lines. But needs to be thought through just a bit more. For example, you could output results to text file(s) and then search those (with EasyFind? I never used it but just guessing it will search text files). With potentially so much information resulting, you’ll likely benefit if the output directed to structured files (database?). Again, just think it thru.

The script to do the command that @Bluefrog came up with can be put into a loop at look at each of your files. Might be fast, might take time. Don’t know. But try out on a few test files and then let it rip.

Gotta say, though, marking text with special fonts to detect things in Word documents seems a little odd. Other ways to have done this, but I understand it’s probably been done for many years by many people who would not recognise this trap.

Sarsie · December 7, 2020, 10:27am

I am about to do just that. I hope I can find someone with the expertise and interest.
Background information
We are working with files some of which were created in 1968. When the output was camera-ready copy (remember that?!!) we had no problem with the 1000-odd files that were published as a “selected correspondence”. When I set up the style sheets that transcribers were [supposed] to follow, I did not create a style which covered this aspect, as printouts produced the effect. The total corpus is now being converted for web publication using some form of automated process of Text Encoding Initiative mark-up I do not pretend to understand, but which loses the distinction. I can apply a style to the parts of the file concerned if I can find files that use it, and this is recognised. Hence my question.

I am actually quite impressed with the rest of the conversion process, which is based on the styles I defined in 1967/68 to be used by research assistants and volunteers in three countries. The style sheets were produced before the TEI got underway. If only I defined this as a style at that time…!

Sarsie · December 9, 2020, 4:26pm

It turns out to quite simple to find fonts used with a document, and EasyFind is a key part of the solution.

Building on the previous suggestions here, a colleague used a small test set and coverted the Word .doc files individually to .html using “save as… webpage (.htm)”, and then using EasyFind a search using “font-family:Times” found the items it shoudl have and bt thiose that did nit contain the font. I then found an application that batch converted.
Summarising:

Batch converted .doc to html using Doxillion
2, Searched output folder with EasyFind using search term “font-family:Times”
The list generated gave the target files.