Hi, I need to convert Twitter bookmarks into a format that’s searchable.
How I did it in the past
Between 2018 and 2020 I used a script (found on the forum) to convert tweet bookmarks into HTML. Records created via this script show the tweet’s text even without internet connection (but no images or videos) which is what I need.
scpt file at some point got corrupted. As I didn’t use it for a long time I did’t notice the corruption and now have no backup of the script and can’t find it on the forum anymore (that’s one reason why I now have everything in git…)
But I suspect Twitter also changed something, so it might well be that saving a HTML that also works when offline isn’t possible at all anymore.
Just in case someone who knows HTML wants to take a look, here’s a record created in 2018 via the lost script and one created today via DEVONthink’s script
Download > As HTML pages
Archive.zip (80.6 KB).
Captures taken without internet connection:
I can’t tell whether it’s the different script or something Twitter changed, just adding this in case someone knows what’s making the difference. But I think DEVONthink’s script
Download > As HTML pages never produced a record whose text is available while offline as otherwise there wouldn’t have been a reason why I used the now lost script.
As I captured some thousand bookmarks (but never converted them) a programmatic solution is needed.
- Did someone use the Twitter API?
- Can it be used by everyone (without working in academic institutions)?
If the API can’t be used by everyone
- Did someone found a programmatic solution that doesn’t rely on the URL being opened in a browser?
If there’s no way to capture without the URL being opened in a browser
- How do you capture tweets in 2022?
I’m willing to put a lot work into finding a solution, however at the moment I don’t even know where to start.
There’s nothing good on Twitter anyways, except our
I wish I had more time to answer this better and do a bit more research, but I regret this is the best I can do right now:
- The Twitter API (and the format of tweets) has almost certainly changed since 2018. It seems to change often.
- A change announced last year may or may not be relevant to you.
Archive Team has a page about archiving twitter in which they list some tools that may be helpful in this context, as well as mentioning some notes, such as how to get the full-sized version of images embedded in a tweet.
- There are other tools not mentioned on the Archive Team page, such as SFM and Thread Reader.
- I’ve personally given up on saving tweets in HTML format. Yes, proper archiving best practices would probably say web archives should be stored in WARC format, but in my experience, the saved content is invariably broken when I view it (e.g.) a year later. Plus, if the author deletes the original tweet, then things are even worse.
- In another forum posting, I described some things I do. Basically, I resort to saving a PDF rendering of the tweet.
My understanding is that Twitter have made it more and more difficult to automate any kind of data collection—programatically or GUI-wise. I don’t think academics have any special privileges in that respect.
Your best bet might be going via the archive.org Wayback Machine. In principle, they will have archived all tweets so you should be able to look up the archived version. May be easier to scrape the data that way. There’s an API for making such requests without a browser.
Might also want to check out this project: GitHub - salcoast/deleted-tweets-archive: These tweets display several bad actors' most divisive uses of the Twitter platform.
Different goals but maybe some useful info.
In terms of capturing tweets in the present, there are some tools in development that allow this en masse (i.e. for quantitative analysis). E.g.: COSMOS – Social Data Science Lab (though I think this is restricted to academic users at the moment, and it’s in request-only beta).
If it’s a matter of grabbing the occasional tweet here and there, I usually take a screenshot and automatically OCR in DEVONthink.
Yeah, that’s what I resort to doing too. A little bit of automation can help capture threads (after logging in to Twitter from within DEVONthink), but alas, only for short threads.
Wow, that’s a lot to read/test. Thanks everyone!
This morning I attempted to save the very many, very informative replies to a particular tweet, which is a case that my usual approach doesn’t handle well. It seems that as you scroll down a long list of replies in the web page, Twitter unloads previous tweets – similar to what Discourse does with long discussions. Just another example of what makes it difficult to capture things on Twitter …
Dynamic content strikes again.
Makes sense for content delivery in cases like this but also makes things difficult for capturing without a bespoke public API available.
Incidentally, great thread…
It might be interesting for DT to monitor how Readwise Reader will do this, as saving Twitter threads is one of their supported features.
If DT doesn’t support it themselves in the future, I might use the Readwise Reader app as an intermediate.
For following twitter accounts, I have a subscription to inoreader for social media intelligence (Facebook, Twitter, Telegram, VKontakte). From there I can search and export to other apps/tools.
For capturing tweets, I clip to DEVONthink from Tweetbot. It is very handy.
Thanks to everyone for sharing their finds and experiences on this subject