split file in multiple files

Hi,
whenever I work outside my house, be it the office PC, an internet café or on a pda, I usually create large files. Especially with my Alphasmart-3000, a writing tool that saves text in 8 different files, I end up with quite long text files. When I am back at home, I have to spend some time splitting the file up in different small files in DT.
I have no experience in scripting - is it possible to set up a script in DTPro which cuts one large text file in seperate pieces? And can I specify the separation while I write? this would be extremely handy for the users of bibliographic software as well, because it would be possible to split a long literature list into seperate files.

For example, I could enter a key or a key combination whenever I want the files to be split. Whenever I would insert, say, the ^ key, the script would know: “Ok, now I have to split the document here”, create a new RTF-File, and go on until the whole large file is split into several RTFs.
I tried Automator, but failed. Do you have an idea?

thank you!
Mark

It’s definitely a case for “csplit,” a utility built into the bash shell.

ss64.com/bash/csplit.html

A little applescript would be very nice (and I would like to see it as an addition to DEVONthink’s offered scripts), where you input the regular expression to be used as a split point, the script uses the currently-selected record or records, and then DEVONthink imports all of the resulting files and renames them to the first line of the created document.

It doesn’t seem to me that it’d be all that hard, but I don’t personally have the mojo for it. It’d be especially useful for people like me, who often download long lists of quotes and anecdotes and so forth.

Here’s a non-scripting way to break long files into small DT entries.

Place cursor at the start of text section.
Hold down the shift key.
Scroll down to the end of text section.
Section will be selected (turn your chosen highlight color)
Drag selected text section to DT far left pane.
Rename new DT entry as you please.

That takes only seconds, and you have control over the entire process.

Here is a script I just wrote, since I have had similar cases where this would come in handy. It’s no doubt fairly ugly and inefficient code, since I’m a n00b.

It asks for the delimiter you want to use (or if you want to split into paragraphs), and then splits the text of the current document using the text item delimiters and imports them into DEVONthink using the text as the title.

I hope that works, and you’re free to modify/distribute/wtfever.

tell application "DEVONthink Pro"
	set theSelection to the selection
	if theSelection is {} then error "Please select some contents."
	display dialog "Enter the desired text delimiter (or nothing to break at each paragraph):" default answer "" buttons {"OK"} default button 1
	set SplitPointRegEx to text returned of the result
	if SplitPointRegEx is equal to "" then set SplitPointRegEx to ASCII character 10
	set OldDelimiters to AppleScript's text item delimiters
	repeat with CurrentItem in theSelection
		set AppleScript's text item delimiters to SplitPointRegEx
		set theSource to the plain text of CurrentItem
		set RepeatCount to 0 as integer
		set TotalCount to (count each text item of theSource) as integer
		repeat until RepeatCount is equal to TotalCount
			set RepeatCount to RepeatCount + 1
			set CurrentText to (text item RepeatCount of theSource)
			if length of CurrentText is greater than 0 then
				create record with {name:CurrentText, type:txt, plain text:CurrentText}
			end if
		end repeat
	end repeat
	
	set AppleScript's text item delimiters to OldDelimiters
end tell

I also have a previous version of this script that I wrote that uses the bash utility cscript and makes temporary files and then cleans up after itself. That might be more efficient than the pure Applescript solution on huge tasks (ie, breaking up a whole damn book), and I’ll provide it if anyone wants it. I haven’t benchmarked them, though, and I doubt it’s anything noticeable.

Thank you for the immensely useful answers, especially the awesome script! It just works great, and is exactly what I have been looking for. The method to drag text clippings manually is also something I can use (together with the groups palette), if I have a text from someone else. In my own texts, I can now use delimeters to speed up the process! Great!
When I tried it today, I found that I could further increase the processing speed by adding numbers. For example, if I have a long eBook, and I want the new files to appear in a certain order, I can add the delimeter plus a number for each topic. For example, I have an article with chapters concerning Quotes from Paul de Man (topic 1) and other quotes concerning Halloween (topic2). At the end of the passage, I can add another delimeter. In this example, what I add looks like this:

^1 [… first text passage on topic 1. . … …] ^
^1b […2nd text passage on topic 1. . … …] ^
^1c […3rd text passage on topic 1. . … …] ^

and

^2a [… first text passage on topic 2. . … …] ^
^2b[…2nd text passage on topic 2. . … …] ^
^2c […3rd text passage on topic 2. . … …] ^

Now the script comes in. I enter the delimeter “^”, and what I get is a number of text clippings that are in a non-arbitrary order. I can now group them. This might look difficult, but for someone who prefers to work with shortcuts, it is extremely handy.

By the way, do you remember Steve Johnsons review of DTPro?

http://www.stevenberlinjohnson.com/movabletype/archives/000230.html

In my eyes, this issue can be handled much easier now - I wonder, how the script will evolve…

Thank you!

Mark

Glad you liked it :slight_smile:

I decided to test it out on Francois Duc De La Rochefoucauld’s Reflections, which is nice because the vast majority of the paragraphs are complete thoughts. It whizzed right through it in about 30 seconds, I’d guess, on an iBook G4 (1.2GHz, 1.25GB).

What Steve Johnson is talking about is possible with csplit, which can break a text every __ number of lines. If there are no line breaks except at paragraphs, which is generally normal except with dialogue, then that should work fine. Of course, with csplit, you can also set up a pretty complex set of conditions to be met, or even a pre-processing of the text – insert section numbers automatically so that you can keep the snippets arranged in order.

Or you can alter the above applescript to say:


create record with {name:RepeatCount & ".  " & CurrentText, type:txt, plain text:CurrentText} 

Maybe (I have to go to class in a couple minutes and can’t check) something like (first 50 characters of CurrentText) might make those titles a little less unwieldy…

And you could add a little if…end if loop to check the snippet for length, and if it’s less than 500 characters, append to it the results of the next snippet, and so on. That would be easy.

I recently found an electronic copy of The Oxford Dictionary of Quotations, which is always fantastic for writing essays. The quoted individuals aren’t separated by any specific symbol or number of line breaks, but fortunately there is a precisely accurate table of contents (ie, including diacritical marks). I spent some time writing a script to separate it into new DT documents, and this is what I came up with.

tell application "DEVONthink Pro"
	set AppleScript's text item delimiters to ""
	set theSelection to the selection
	if theSelection is {} then error "Please select some contents."
	set ItemCounter to 0 as integer
	set OldDelimiters to AppleScript's text item delimiters
	set theSourceText to "2283472018920498327409012029383483748291948273498329849328" as string
	repeat with CurrentItem in theSelection
		set theSource to the plain text of CurrentItem
		set AppleScript's text item delimiters to ASCII character 10
		set BigCount to 1 as integer
		set theDelimiters to the text items of "2.0 B 
3.0 C 
4.0 D 
5.0 E 
6.0 F 
7.0 G 
8.0 H 
"
		set TopCount to (count each text item of theDelimiters) as integer
		repeat until BigCount is equal to TopCount
			set SplitPointRegEx to text item BigCount of theDelimiters
			set AppleScript's text item delimiters to SplitPointRegEx
			if theSourceText is equal to "2283472018920498327409012029383483748291948273498329849328" then set theSourceText to the text items of theSource
			set LittleCount to 1 as integer
			set TextCount to (count each text item of theSource) as integer
			repeat until LittleCount is equal to TextCount
				set ThisItemText to the last text item of theSourceText
				set AppleScript's text item delimiters to text item (BigCount + 1) of theDelimiters
				set ThisItemText to the first text item of ThisItemText
				set AppleScript's text item delimiters to ThisItemText
				set NowCount to (count each text item of theSourceText) as integer
				if NowCount is equal to 3 then set theSourceText to items 2 thru -1 of theSourceText
				if NowCount is equal to 2 then set theSourceText to the second text item of theSourceText
				if NowCount is equal to 1 then set theSourceText to theSourceText
				set AppleScript's text item delimiters to ""
				create record with {name:SplitPointRegEx, type:txt, plain text:ThisItemText}
				set AppleScript's text item delimiters to SplitPointRegEx
				set LittleCount to LittleCount + 1
			end repeat
			set BigCount to BigCount + 1
			display dialog "Continue?"
		end repeat
	end repeat
	set AppleScript's text item delimiters to OldDelimiters
end tell

Notes:

  1. The long number I set theSourceText to is just a random number, a way I can check whether it has been set to an actual source text or not without any possible worry about whether an actual source text might have the same contents as my marker. I don’t think it’s necessary, but it seemed like a good idea at the time.

  2. This is probably extremely inefficient, but I couldn’t get it to work in any other way.

  3. It’s SLOW… but I blame it on the size of the Dictionary (over 580 000 words). The delimiters I have up there now are to split it into smaller files.

  4. I tried to make it so that the user’s typed/pasted input into a dialog would become the list of delimiters. However, it didn’t seem to work.

Anyway, this works quite well for me. Hope someone else can get some use out of it. It should work with any document for which you have a table of contents of some sort…

I have just stumbled across this after trying to do something similar for days! Thank you so much!!

Chris

Wow, talk about Karma.

I was thinking of requesting something to break up longer texts into smaller files just a few days ago.

I haven’t had a chance to play with the scripts but I surely will.

thanks to all.

:slight_smile:
cheers

Dear folks
as you may have noticed, the script works great for simple text notes, but not with rtf so far. I have no experience with apple script, and I tried the last hours to alter the “split”-script, but it won’t work. Sadly, I am not even able to get kalisphoenix’ alteration to the script to work. :blush:

What I am trying to do is to alter kalis’ script in a way that it

  • creates rtf files
  • in the current group
    -names the new files like the original files but with a running number as addition - e.g. “filename” will be split into something like filename-01 filename-02 filename-03)

If scripting is not like higher math to you - could you have a look at it? :unamused:

Thank you,

Mark

Tried but it returns me plain text files with just currentText as title. I tried also to have a display dialog to set part of the name to my choice (let’s say “1.mytitle” etc) but didn’t succed
could somebody help me?

I searched for three hours to find a program … a plug-in … a script … anything! … that would do exactly this.

THANK YOU THANK YOU THANK YOU!!! :mrgreen:

This works fine for me except for the option with adding numbers. Also, just hitting return does not split the document at paragraph marks. I don’t know enough about AppleScript to make it work. Anyone out there who could? This would be really useful for me if I could get it to number the split documents so they stay in order. Thanks.

The original script is great, thanks!

But it won’t work with selection in other programs, at least those who i wanted it to work with (Taskpaper)!

The alternative is of course importing the file as plain text to devonthink, and then performing the script.

I wonder if this can be tweaked to work in all apps with current selection, or, more feasibly, with current clipboard content. Unfortunately, i still don’t know how to write applescript :frowning:.

Merry christmas to all.

This almost exactly what I’ve been looking for.

My use case: copy my Kindle highlights/notes from the kindle.amazon.com page and then split so I have one DTP note per highlight/note. But it needs to work with RTF because then you retain the hyperlinks “Read more at location 3693” which is fantastically useful, because a single click then opens the book in Kindle Reader at the precise location.

So if I could just repeat the request the OP made back in 2007 – if you know enough AppleScript to modify this to work with RTF it would be really great to have this!

RTF is an entirely different animal than plain text. This is no trivial task to do this under-the-hood (and I know more than enough Applescript to say this).

Ah well, that probably explains why it’s not been done to date.

To be honest it’s hardly a big deal to switch to the kindle reader app and type in the location manually, just not quite as cool. Certainly not much of a ROI if it would take a lot of time.

I split any book in 5 seconds. Just use Adobe Acrobat 11: View, Tools, Pages, Split Document. You have several options: split every (1,2,3,4,5 etc) page(s), split by bookmarks etc…It includes page number before (or after) title. When you index or import into DEVONThink all you have to do is to sort it so each split page reflects the exact order of the original.