I need to identify which files within a set of about 15, 000 transcriptions of historic correspondence contain a specific font. The font has been used to differentiate components of a source that were printed (forms, letterheads and the like). Thus the files concerned are using more than one font.
Files are in Word saved with the “.doc” suffix.
I am using EasyFind 5.0 on a machine running OS 10.15.7
Welcome @Sarsie
EasyFind doesn’t search for fonts in documents.
Also Word documents aren’t text-based, so you can’t use a contents search with them.
Thanks for responding.
But EasyFind searches beautifully for content to find specific terms (simple or Boolean) in the corpus of 15,000 files each named “.doc”, so I may have misunderstood your “Also…”
Were these .doc files created in older versions of Word?
Yes, starting with Word 3(!!) in 1988, but they have been updated as necessary so they are now worked upon using Word for Mac, version 16.43, saved as “.doc” files.
It is possible very old documents could be found. However, newer versions of Word files aren’t text-based so can’t be searched by contents.
Your message prompted me to save a couple of file duplicates as .docx. I see what you mean.
Hm. Aren’t these just compressed archives consisting of XML files? Of course one can’t search in the original file, but after decompression…?
Update: Yes, they are:
> unzip -l Arbeitskopie_Marktuebersicht_KI-SOC.docx
Archive: XXX.docx
Length Date Time Name
--------- ---------- ----- ----
1445 01-01-1980 00:00 [Content_Types].xml
590 01-01-1980 00:00 _rels/.rels
74133 01-01-1980 00:00 word/document.xml
1862 01-01-1980 00:00 word/_rels/document.xml.rels
8387 01-01-1980 00:00 word/theme/theme1.xml
3747 01-01-1980 00:00 word/settings.xml
14051 01-01-1980 00:00 word/numbering.xml
33455 01-01-1980 00:00 word/styles.xml
752 01-01-1980 00:00 word/webSettings.xml
3683 01-01-1980 00:00 word/fontTable.xml
749 01-01-1980 00:00 docProps/core.xml
992 01-01-1980 00:00 docProps/app.xml
--------- -------
143846 12 files
It might be interesting to have a look at word/fontTable.xml
, for example
Yes, but EasyFind isn’t in the business of unzipping any of the .x files Microsoft Office products produce.
Some hope there. See attached screen shot for a simple two line Word DOCX file with three fonts. Here is how the fontTable.xml looks (using BBEdit to look inside the DOCX file directly). See how the “American Typewriter” and “Calibri” are mentioned.. Find a way to do a batch decompress and look there. Maybe will work. Dunno. Have to play with it a bit, I guess.
On firsrt try the XML didn’t come thru. See attached pdf of the XML.
<?xml version="1.0" encoding="UTF-8" standalone="yes"? [fonttable.pdf|attachment](upload://50pwg6UlUuGUjIEfprqTk6XyojG.pdf) (30.6 KB)This is way outside the purview of EasyFind but…
unzip -p -q neumann.docx word/fontTable.xml | grep -i -o Calibri
would look for instances of Calibri in the unzipped contents. It also doesn’t muddy things up by unzipping to directories. Perhaps not perfect, depending on the font being searched for but something anyhow.
Amended to only consider the fontTable.xml
file as the theme.xml
may contain the same fonts.
Probably the best approach. Didn’t occur to me, but a good one.
Thanks. Amended to be a bit more specific.
Thanks for this.
Much appreciated as it is outside the remit of EasyFind and this forum
I will need to ponder as it seems that I will need to start with a mass conversion of all 15, 000 files from .doc to .docx, mass decompress those, and then search for the font.
Getting beyond my technical expertise, but a good steer.
Find someone in your world who knows about shell scripts in Unix and/or OSX (Apple’s operating system is based on Unix). There are many of these people but you might have to ask around. It should be a short script of a few lines. But needs to be thought through just a bit more. For example, you could output results to text file(s) and then search those (with EasyFind? I never used it but just guessing it will search text files). With potentially so much information resulting, you’ll likely benefit if the output directed to structured files (database?). Again, just think it thru.
The script to do the command that @Bluefrog came up with can be put into a loop at look at each of your files. Might be fast, might take time. Don’t know. But try out on a few test files and then let it rip.
Gotta say, though, marking text with special fonts to detect things in Word documents seems a little odd. Other ways to have done this, but I understand it’s probably been done for many years by many people who would not recognise this trap.
I am about to do just that. I hope I can find someone with the expertise and interest.
Background information
We are working with files some of which were created in 1968. When the output was camera-ready copy (remember that?!!) we had no problem with the 1000-odd files that were published as a “selected correspondence”. When I set up the style sheets that transcribers were [supposed] to follow, I did not create a style which covered this aspect, as printouts produced the effect. The total corpus is now being converted for web publication using some form of automated process of Text Encoding Initiative mark-up I do not pretend to understand, but which loses the distinction. I can apply a style to the parts of the file concerned if I can find files that use it, and this is recognised. Hence my question.
I am actually quite impressed with the rest of the conversion process, which is based on the styles I defined in 1967/68 to be used by research assistants and volunteers in three countries. The style sheets were produced before the TEI got underway. If only I defined this as a style at that time…!
It turns out to quite simple to find fonts used with a document, and EasyFind is a key part of the solution.
Building on the previous suggestions here, a colleague used a small test set and coverted the Word .doc files individually to .html using “save as… webpage (.htm)”, and then using EasyFind a search using “font-family:Times” found the items it shoudl have and bt thiose that did nit contain the font. I then found an application that batch converted.
Summarising:
- Batch converted .doc to html using Doxillion
2, Searched output folder with EasyFind using search term “font-family:Times” - The list generated gave the target files.