What to do about large PDF's?

In my collection, I have a number of large PDF’s; more or less, complete books, which go up to 400 or 500 pages.

Obviously, there is a lot of material (word-wise) and many topics in these PDF’s. This seems to be having an adverse affect on my overall collections AI abilities, such as “see also”. Whenever I choose a topic which is even remotely related to something in one of the PDF’s, these PDF’s fill up my “see also”. Since some of these PDF’s are books in philosophy, almost ANYTHING I select and then “see also” pulls these PDF’s up (philosophy is funny like that).

As a side-note, this experience – to me – seems to strengthen the Steve Berlin Johnson theory of the 500 word sweet spot.

I’m interested in hearing how others who have run into similar experiences have dealt with it…or if it is best to just ignore it.

If you could send me the database (e.g. optimized and zipped via Scripts > Export > Backup Archive…), then I could check if there’s a way to improve this.

I’ve got lots of large PDF files and by and large they don’t support the Johnson theory about a 500 word “sweet spot” because most of them show up with a pretty appropriate ranking when I do a “See Also” request.

But I’ve got two PDF files that are “magnets” when I do See Also.

The most curious one is the 2005 Infiniti G35 owner’s manual. It’s nowhere near as large in word count as many other PDFs in my collection and is pretty much limited to the topics one would expect.

Let’s say I do a search that will pull some documents about lasers. I pick one about Intel’s research with silicon lasers and do “See Also”. Right. My Infiniti owner’s manual is the “most similar”. But the other suggested items make a good reading list for the topics addressed in the summary of Intel research.

Let’s say I do a search on the toxicity of lead in drinking water. The results are on target. I pick one, about lead in drinking water in Washington, DC. I do a “See Also”. My Infiniti owner’s manual is the third most similar. Most of the other suggestions are on target, including some lower ranked PDFs that are bigger than the Infiniti owner’s manual.

The only other really bad suggestion was the Internal Revenue Service’s guidance for filling out my 2005 income taxes. Perhaps my computer heard me complaining about it at tax time and concluded that it’s a toxic document. :slight_smile:

But I suspect the tax guidance wouldn’t have made the list if the article about lead in drinking water had been in another location than Washington, DC and also released by a federal agency, with the standard language that accompanies release of federal agency documents. So it’s not as curious in this context as the Infiniti owner’s guide, and ranked much lower.

So I will send that Infiniti PDF to Christian and ask him to play with it in his databases.

The documentation says:

This implies that you could place all your pdf “magnets” in the same group, get info on the group, and select “exclude from classification.”

In my experiments with this, I’ve found that “exclude” an item will prevent it from showing up in a query (e.g. my economics texts no longer appear as hits when I search for “inflation”). However, when I select a text that contains does contain content about inflation, and then “see also,” there they are again - my economics texts are back. Even though “excluded,” they show up in the “see also” list.

I need to fiddle with this a bit more. Has anyone else tried this?

Fred, long ago I checked the Info panel on that Infiniti automobile user’s manual.

That did seem to keep it from being a “magnet” for classification, but it still shows up weirdly in See Also suggestions.

However, checking it to prevent classification did nothing to keep it out of search query results. It behaves properly in searches and shows up when it should.