Indexing vs. Importing Revisited

I’ve been looking through old forum entires on Indexing but have not found answers to the following question:

If I choose to index a large folder of pdfs in my “documents” folder in my DTPO2 dbase, will DT do the following

  1. ‘update’ its indexing when I add/remove files from that folder?
  2. search the entire contents of those files if they are readable Pdfs?
  3. search or recognize annotations to those files, e.g. added in Skim or Adobe Acrobat?

The scenario is this: I use Sente, the bibliography software, to organize and store my pdfs in one folder which is available in the finder, but accessible through Sente. (It’s a nice set up, ideal in my pre-DTPO days.) However, this folder changes every time I add a new article. Ideally, I’d love to continue to use Sente to ‘house’ all my pdfs, but be able to fully search and index them through DT, w/ DT automatically updating its index as the Sente folder changes.

Thanks for your help!..

Yes, after using File > Synchronize for the indexed folder/group.

Yes.

Annotations are recognized, displayed but not yet indexed and therefore are not searchable.

Hi-

I’ve been working with this workflow, but I wanted to see if I could update it even more. So, right now, I have a file with all my sente pdfs in the finder, which is organized and labeled by sente. I plan to syncronise and index that file with a DTP2 database.

(1) Is there a way for DTP2 to automatically OCR any non-OCR’d files in indexed folders?? I.e. if there were files that were attached to Sente which were not OCR’ed already, could I OCR them with DTP2 without* importing them into DT’s database (leaving them only as indexed files)? That way I could have all my secondary source pdfs in one place while also being able to search them in DTP2.

I assume this is the best way to handle the Sente-DT divide when it comes to using files that you will eventually cite.

I’ll leave it to heavy-duty scripters to come up with procedures to trigger automatic OCR of image-only PDFs in Indexed folders.

Here’s an approach to create a smart group that will list all the PDFs that have been Index-captured to your database(s), and that contain less than ten words of text.

Open the full Search window (Tools > Search). Set it to search ‘Databases’ (so that all open databases will be searched). Click on the ‘Advanced’ button and enter the criteria as shown in the screenshot.

Criteria in Advanced button.jpg

Hit Return to invoke the search.

Why did I enter a number for the Word count? Because some PDFs, e.g., from sites that provide PDFs of newspaper clippings, etc. do not OCR the image of an old newspaper article, but do include searchable text as to the source of the article. If you work with such PDFs, experiment with a word count suitable to display these in the smart group.

Click on the ‘+’ button to the right of the query field to save the search as a smart group, and name it, e.g., ‘Indexed PDF OCR Candidates’. As it is for all open databases, this smart group will be saved in the left Sidebar (which displays the Global Inbox).

Remember to reset the ‘Advanced’ button in Search when finished.

It’s probably best not to batch select all the PDFs listed in the smart group, as that might result in moving items from the groups in which they are currently filed.

If Preferences > OCR has the option to move the original PDF to the Trash CHECKED, the original will be deleted. You may find it useful to select a PDF and press ‘Command-R’ (the Reveal command) to see it in the group where it is filed. From that location, select the PDF and choose ‘Data > Convert > to searchable PDF’.

However, as the searchable PDF is stored within the database, and isn’t currently indexed, there’s an option to move it to the external folder that had been Indexed. Select the PDF, Control-click and choose the contextual menu option to move it to the external folder.

Finally, select the group corresponding to that external folder and choose ‘File > Synchronize’. Now the searchable PDF is Indexed, and is among the items listed within the Indexed group.