exclude from classification exclude from the index - how?

Maak · March 26, 2007, 11:07am

In the info panel, there is a button to “exclude from classification”. Now I have several large PDF+text files in my database which I want to read, but I want to exclude them from indexing, because I have them in the database a second time, as individual pages (search results get better). How does this “exclude”-button work; what does it actually do? Is there any other way to exclude a (pdf) file from the index?

I am asking because although I checked this option, the PDFs still show up in my “see also”-list.
Thank you,
Mark

Bill_DeVille · March 26, 2007, 4:37pm

Hi, Mark. As you discovered, excluding a document from Classification doesn’t exclude it from the See Also list.

Exclusion from the See Also list would have advantages in certain cases and that’s something the developers will consider.

Thanks for the comment.

Maak · May 28, 2007, 11:46am

Thank you for your answer, Bill, but since the next version of DTPO may take a while, is there any workaround for the present situation?

Since I am using my new fujitsu scanner, I generated a few hundred pdf files which I need to refer to now and then, but they get in the way when I am searching something. I have lots of rtf-notes in the db which contain my personal thoughts and notes about a subject or a particular paper - usually I keep the original paper with my comments in the db. This is really useful when I cite something and need to look up the context of a particular quotation, but “see also” results have become so fuzzy lately that the function is not really useful.

I even thought about duplicating the DB and deleting all the pdf files in there, but then I’d have to switch databases constantly when I want to perform a “see also”. Is there any other way to keep my pdf articles out of the index?

thank you,
Mark

Bill_DeVille · May 28, 2007, 4:29pm

No current workaround except by splitting databases.

That’s one of the reasons I use topical databases.

Maak · May 29, 2007, 2:59am

thanks for your answer, bill, but my database is a topical database. It is about my dissertation topic. Up to now, the documents are well organized, but as any project in the humanities, it consists of articles (.pdf) and my notes (.rtf) - referring to the articles is sometimes necessary, but more often I’d check my notes. I think that with OCR support not only users in the scientific community will end up with more and more large pdfs in their db, topical or not.

I hope excluding from index will be possible in a later version?

Mark

Bill_DeVille · May 29, 2007, 3:31am

it may become possible to exclude a specific document from the See Also operation.

But I don’t think you will want to exclude PDFs from indexing, as there’s a great deal of value in analyzing their content.

I’ve got many hundreds of PDFs in my main database, many of them quite large and some exceeding 500 pages. They remain very useful to me for searching and analysis.

I prefer capturing the content of scientific articles from HTML as rich text documents that include text, images, tables and hyperlinks, rather than as PDF documents. I’ve got over 10,000 articles from journals such as Science, Nature and others captured in this way. Often, the PDF version supplied by journals includes ‘spillover’ material from one or more other articles (which reduces the ‘focus’ of the document). Some journals such as Science often include links to supporting online materials and references, which are not directly available in the PDFs that are available.

I don’t like to save HTML or WebArchive versions of articles from most journals, as they tend to include advertisements and other extraneous material.

One of these days inclusion of hyperlinks in PDFs available from journals will become more common – but it’s still rare.

I don’t ‘split’ long documents into smaller chunks.

Tip: Running See Also from a large report or book may not focus on the topic I’m interested in. But I can select a paragraph, a section or chapter, control-click and choose the contextual menu option See Selected Text to limit AI suggestions to the concept I’m exploring.

Maak · May 29, 2007, 7:21am

You’re right - excluding a document from the See Also-results is more important than from indexing, because indexing allows to find the document if one forgets the title. But it will improve my “see also”-results if the list is not full of pdf articles.
I already use “see selected text” a lot and it is one of the most interesting features of DTPO for me. I would benefit from this function even more once it is possible to exclude some articles from the results.

Mark

cgrunenberg · May 29, 2007, 7:49am

A future release will probably add the possibility to exclude items from searching and/or see also.

Maak · May 29, 2007, 1:59pm

This is good news, Christian - thank you. Until then, I’ll continue to scan and group my documents, even if see becomes a bit foggy… It is good to see that the problem is adressed.

Mark