I have a question about hot to split a Excel file.
Let’s try to explain. I have a lot of Excel files, and each one, has a line with some information about a patent (name, number, abstract, year, etc).
** What I need to do is: each line should be a plan text in DEVONthink.
I show below some pictures to illustrate that. Maybe some of you should help me with a tip about how to create a macro or another way to convert Excel lines in external text files (one text file for each line).
My problem is: I have about 20.000 lines to input at DEVONthink.
You might look into TextSoap, a utility with a lot of tools for processing text files. There is a free trial, I believe. https://www.unmarked.com/textsoap/
It’s called “data munging” (Google the term). I’ve done this sort of thing hundreds (or more) times. While it is certainly possible to do it all in Excel to create the files that you want to put into DEVONthink, that probably would be the hard (and time-consuming) way. It’s certainly not something for DEVONthink to do for you except for the final step of importing which is your goal.
The process, which can be described as an algorithm, would be something like the
following (which is “psuedo-code”):
Step 1: Decide which columns of the Excel file are going to be exported
Step 2: Decide if each line in each Excel file is an individual file (I think that the case, from what you say). If not, then decide. In addition, decide if there is any transformational stuff you can do in the code to the info in the files, e.g. concatenate columns, do some standard re-wording, etc. that you can automate. Develop output file format. Decide output file naming convention. Use *.txt suffix.
Step 3: Develop some code which does something like the psuedo-code below. Test, adjust, correct … recycle till it does what you want.
Step 4: Re-run it all on the thousands of files and yet more thousands of records in Excel to create the text files.
Step 5: Import the new text Files into DEVONthink and do your DEVONthink stuff.
Psuedo-code (‘=’ characters to make it look right … best way I could figure for this) that you would run inside a Mac OSX “terminal”. It would be something like:
read list of Excel file names (all in same folder to make easy)
for each file in list:
====put full contents of the Excel file into a string variable
====for each line (Excel row) in contents variable:
========for each Excel column in the contents:
============copy the relevant columns into individual variables
========build the output as string variable by concatenating/transforming columns See step 2).
========create a unique new file name (see Step 2).
========open a new output file using the new name
========write the output string created above
========close the output file
The tools used depends on who does the work. First step would be to find someone in
your firm or “diaspora” who knows data munging and has some experience and tools. Use the tools that person knows how to use. Or do it all yourself. I like and would use Python supplemented by Pandas. Probably a few hours or so (maybe more, maybe less) to figure out what to do then do it. My gut says 15 minutes. My experience says “be careful of what gut says, please.” Then test, then … let it rip on your data. Put some “print” statements in code to watch progress. Probably good to put some exception testing in to make sure it’s all ok. Inspect the results.
If your colleague is an Excel person and wants to use, I’m sure it would work also. The work process wouldn’t be much different than that expressed in the psuedo-code above.
Recommended Resources (let you and your data munging colleague explore)
Google “read excel into python” and “data munging with python”
Python (already on your Mac, but recommend using Anaconda to install a fresh Python and any other tools you may want to use)
Pandas for Python (use the Excel reading functions, et. al. Arrives with Anaconda)
see examples via Google in Stack Overflow
None of these resources will cost you anything. Just your time to investigate.
I don’t know anything about scripting Excel, but it seems like a two-step process:
(1) Extract each row in an Excel file into individual text files into a folder in macOS by AppleScript . The 2nd post in the link below shows the basic script (I haven’t tested it but that’s the idea) but someone who knows Excel scripting better may be able to automate the range selection. https://discussions.apple.com/thread/7238131
(2) If the script in the above link works, then you can import/index the text files into DT3.
The VBA with a tiny change in file path in the above mentioned link:
Sub toFile()
Dim FilePath As String, CellData As String, LastCol As Long, LastRow As Long
Dim Filenum As Integer
LastCol = ActiveSheet.UsedRange.SpecialCells(xlCellTypeLastCell).Column
LastRow = ActiveSheet.UsedRange.SpecialCells(xlCellTypeLastCell).Row
For i = 1 To LastRow
FilePath = "/Users/user name/Desktop/TestFolder" & "/" & Trim(ActiveSheet.Cells(i, 1).Value) & ".txt"
Filenum = FreeFile
Open FilePath For Output As Filenum
CellData = ""
For j = 2 To LastCol
CellData = Trim(ActiveSheet.Cells(i, j).Value)
Write #Filenum, CellData
Next j
Close #Filenum
Next i
MsgBox ("Done")
End Sub
My another 5 cents: perhaps it means to consider going to Excel/VBA forum for help to automate the process of exporting the rows of a butch of Excel files…
@ngan I must to say many thanks.
In these days, that everything could be hard and difficult, we find nice people that do you well to help us.
@rmschne I will keep in mind your suggestion. I understood that I need time to know well about “data munging” but should be great to learn it in a soon future.
@GordonMeyer I used textsoap for another topic and it works very well.
The VBA (sorry that I used to be able to do way better than this 20 years ago )
Sub toFile()
Dim FilePath As String, CellData As String, LastCol As Long, LastRow As Long
Dim Filenum As Integer
Dim theFileName As String
LastCol = ActiveSheet.UsedRange.SpecialCells(xlCellTypeLastCell).Column
LastRow = ActiveSheet.UsedRange.SpecialCells(xlCellTypeLastCell).Row
For i = 3 To LastRow
theFileName = ThisWorkbook.Name & " - " & ActiveSheet.Cells(1, 1).Value & " - " & Trim(ActiveSheet.Cells(i, 1).Value)
FilePath = "/Users/username/Desktop/TestFolder2" & "/" & theFileName & ".txt"
Filenum = FreeFile
Open FilePath For Output As Filenum
CellData = ""
For j = 2 To LastCol
CellData = ActiveSheet.Cells(2, j).Value & ":"
Print #Filenum, CellData
CellData = Trim(ActiveSheet.Cells(i, j).Value) & vbCrLf
Print #Filenum, CellData
Next j
Close #Filenum
Next i
MsgBox ("Done")
End Sub
I want to rename all files in a database.
The second line is the title that I wish as a new filename.
But:
some titles are so big
some titles have special characters as + or /
Is it possible to create a way to rename it, using an automatic command or script that:
select the second line of a markdown file as its name
limit this filename into 30 characters or 8 words (for e.g.)
original name: Method and apparatus of three-dimensional printing and electronic apparatus
text to use to rename: Method and apparatus of three-dimensional printing
There might be a better/more elegant way to script, and I haven’t tested the scripts extensively.
For words.
on performSmartRule(theRecords)
tell application id "DNtp"
repeat with theRecord in theRecords
set theText to plain text of theRecord
set theName to paragraph 2 of theText
set theWordCount to count (every word of theName)
if theWordCount > 8 then
tell application "TextEdit" to set AppleScript's text item delimiters to " "
set name of theRecord to (words 1 thru 8 of theName) as string
tell application "TextEdit" to set AppleScript's text item delimiters to ""
else
set name of theRecord to theName
end if
end repeat
end tell
end performSmartRule
For characters
on performSmartRule(theRecords)
tell application id "DNtp"
repeat with theRecord in theRecords
set theText to plain text of theRecord
set theName to paragraph 2 of theText
set theCount to count theName
if theCount > 30 then
set name of theRecord to texts 1 thru 30 of theName
else
set name of theRecord to texts 1 thru -1 of theName
end if
end repeat
end tell
end performSmartRule
I don’t know how to deal with special character, not familiar with regex.