Searching google news = puzzling lack of results

chatoyer · December 11, 2014, 8:26pm

Hi all,

Although I’ve owned DA Pro for quite some time, I’ve never really taken the time to get to know it and make it work for me.

Unfortunately, I’m struggling with even getting a basic search in Google News using the Google News plugin to give me results that are equal to what I would get by going to news.google.com itself.

I’m looking for two terms (separated by AND). I’ve created a new search set, listed news.google.com as the site to search (in ‘search’ mode), ticked the box to use the google news plugin, set my default query as “TERM1 AND TERM2” (without quotation marks), deselected the ‘archived pages’ filter (assumption: give me everything!), and ensured ‘all pages’ is selected in results. Run that and I get… three results.

Run it in google news directly and I get about 10 results just with today’s date alone, and dozens and dozens more going back months.

I’m hoping I’m missing something obvious. Would appreciate any help.

chatoyer

korm · December 11, 2014, 8:40pm

Could you provide specific search terms so that other readers can check results for themselves? I.e., “X AND Y” – what is X; what is Y?

chatoyer · December 11, 2014, 8:45pm

Thanks korm - I ran it again (no changes to set) and am now getting 19 results, all of which are news items dated either today or yesterday. Great, in one sense, but wouldn’t mind diving back in time a bit more.

airline AND ownership

Cheers!

chatoyer · December 11, 2014, 9:00pm

Oddly enough, when I ask it to archive the results (all 18), it only archives 3.

FROBGOBLIN · December 13, 2014, 1:50am

Hi. I’ll take a look at it. I’ve been getting more results than that, but less than expected. However, I haven’t had much time in recent days to look into it, and I’ve been moving around, so my Internet connection is too slow. This weekend I ought to have a chance to dig into it a bit more.

FROBGOBLIN · December 13, 2014, 2:01pm

Out of 4115 files searched I got 85 results. The search set I used is here:
evernote.com/l/AAHRB1FzWr1H … jVJ43HIUx4

FROBGOBLIN · December 13, 2014, 2:46pm

By the way, it is nice to use a bit of automation to search a single site, but I am not sure you’ll save a whole lot of time or effort that way for a one-off search. Doing it on your own on Google News with the DEVONthink clipper might not be a significantly different experience.

But, I think three of the strengths of DA for your particular search would be its ability to perform searches of multiple sites (not just Google News, but Yahoo News and any other news aggregation / news sites), its ability to perform this on a schedule (maybe the early morning, when you aren’t using the computer), and its ability to save all of the results into DEVONthink automatically.

Over the course of a few days, I can sometimes accumulate hundreds or thousands of PDFs (I prefer to save pages as PDFs) on a research topic without doing anything beyond the few seconds it takes to create the initial search set. I’ve found this to be a great time saver and a way to be exposed to sources of information I’d otherwise miss relying only on surfing alone.

Bill_DeVille · December 13, 2014, 3:50pm

FROBGOBLIN makes a good point. For topics in which I have a continuing interest, including a research project during its progress, I create custom scheduled search sets in DEVONagent Pro. In these search sets I check the option to send me an email listing the results of each successive run. That provides a convenient means of scanning a list of new captures to the Archive to look for gems.

In many cases I’ll also check the option to send the results of each search run to DEVONthink. Unlike FROBGOBLIN, I avoid capture as PDF, as I’m working on a MacBookPro retina with a 500 GB SSD, and want to avoid excessive use of disk space. I do the captures to DEVONthink as HTML, instead, as file size is often greatly reduced.

True, HTML capture has disadvantages; a major one is that images are not available for offline viewing and may become unavailable if the page disappears from the Web in the future. These captures also include the full Web page, and my practice is to avoid full-page captures that introduce irrelevant text, which would reduce the efficiency of searches and the AI assistants in DEVONthink.

I try to take a little while each day to prune new HTML additions to my DEVONthink databases. The first step is to decide whether a captured item sent by the DEVONagent Pro search set is worth keeping. If not, I delete it. The second step is to use the keyboard shortcut Command-) for the Service to capture as rich text, if there’s irrelevant text and images (or images I want to keep). I’ll then quickly edit out unwanted content in the rich text document and delete the HTML version.

Comments: DEVONagent Pro has a number of settings that may be useful and should be examined. For one thing, there’s a Preferences setting that imposes a maximum file size on downloaded documents. I usually set it higher than the default setting, especially in cases where a review of the Log shows that documents were skipped because of a larger file size than currently allowed. Other settings in a search set may include or exclude documents by filetype, &c. For example, although most pages are HTML, I probably don’t want to exclude PDF or Word documents relevant to my interests.

chatoyer · December 22, 2014, 12:53pm

Thanks for the responses, everyone. Appreciated. It seems to have righted itself over the last week or so. I now have several automated search and archive tasks taking place on a nightly basis that are capturing pretty much what Google finds. I think the behaviour I was seeing at the top of this thread may very well have been an anomaly.

Good suggestion from FROBGOBLIN re: multiple sites, not just google. I will do this to expand the capture. For now, I’m having DA simply archive and, like Bill, I review every day or so to see if anything has popped up that is worthy of being hauled into DTPO.

Bill, interesting that you do HTML capture for DTPO storage. I generally do Web Archives or pdf as my goal is to keep things for the long haul. I now have an aviation news database (among other databases) with 6000 news items going back to 2007 when I started with DT. With any luck, SSD sizes will keep pace with my appetite.

chatoyer

Bill_DeVille · December 22, 2014, 4:46pm

If one is making full-page captures, HTML files are the smallest filetype that will include permanent searchable text and links in the capture. That’s an advantage. But I retain few HTML files in my databases, because their images are dependent on continued only access to them and (more importantly) I don’t like full-page captures because they often include irrelevant text and images. Irrelevant text reduces the efficiency of searches and the AI assistants.

I don’t care about the niceties of design of the Web page, but I do care about its information that I consider useful.

WebArchive captures sent from DEVONagent Pro to DEVONthink have the advantage of retaining images so that they are available offline. But they are full-page captures and may result in large files.

When performing scheduled searches the search set is configured to send HTML results of each run to a DEVONthink database. Later, I’ll inspect the results and weed out those I don’t want to keep. Those that I keep will be converted to rich text and the HTML version will be deleted afterwards. Then I’ll delete the unwanted images and text from the rich text document.

Today, for example, one such document would have required 12.6 MB file storage space if saved as full-page WebArchive. But the rich text document that contained the desired text and tables of the Web page required only 6.8 KB storage space. Not even one of the images on that page was relevant, and there was more irrelevant than relevant text on the original page.

With a quick edit I eliminated the potential of a lot of false positive results in searches and AI suggestions and saved orders of magnitudes in storage size of the document. That’s not at all uncommon.

I work on a MacBook Pro Retina with a 500 GB SSD. If I had retained full-page WebArchive or PDF versions of all the documents I’ve captured from the Web, I would have run out of storage space on that drive long ago. But by reducing captured documents from the Web to only the actually desired content, I still have a reasonable amount of free space on the SSD, so it continues to run quickly and retain room for more captures.

FROBGOBLIN · December 23, 2014, 1:41am

I prefer one-page PDFs automatically put into my account so that the results appear in searches and the content is retained intact for the long term. I weed out the crud when I have time, and I move stuff onto an external drive when necessary. However, even with only 256 GB, I still have plenty of space for more. It requires a little bit of data management. Basically, I have everything I don’t need for daily use on an external drive. This frees up a lot of space for the more important stuff. If you go paperless, including your library of books, then you’ll easily end up wi th terabytes of data, and that won’t fit on any local storage anyhow.

The nice thing about DEVON stuff is the flexibility. Bill does HTML, I do PDFs, Bill has a bunch on his local drive, I have a bunch on my external drive, etc. You can tailor it to fit your workflow without being constrained by cloud limits or proprietary formats.

And, it is entirely secure. There are a lot of people throwing their hands up in the air over security and privacy these days. I don’t think they have found DEVONthink yet

chatoyer · December 26, 2014, 12:56pm

This is fantastic. For some reason I had my size column hidden and, once revealed, I too had a number of web archives well over 10MB. I think I will start converting a few of these. They important enough to keep, but not important enough to occupy that much space.

Thanks Bill.