2022 May 21

Reference Management: What am I managing?

In a previous post I had written how I use org-mode to manage references. Since then, it has significantly evolved with more features. I’m try to document the features I use and how I go about doing that with emacs. If seeing “emacs” is scaring you, fret not! I’m going to have that part in a different post. My intention with this post is to simply list out problems I had to find solutions for, or trying to find solutions for, and my workflow when it comes to reference management. This process makes my life a bit easier, at least when it comes to the reference management. Perhaps this can act as a road-map for someone trying to figure out what features they can/should have in their choice reference management software. Before I start I have to mention, the reference management system I have is also tightly integrated with my knowledge management system (which also includes project management), where I use a combination of mind-maps/concept-maps and zettle note taking. So throughout I’d be touching on how the reference management system ties into that as well.

Adding entries to my reference database directly from the browser

When I come across a paper I want to add to my database, I want to be able to add it to my database with a single keystroke (can use libproxy. Also, I had written a browser plugin for UBC access which can be easily modified). Quite often, this also would involve accessing the papers though your respective institutions access. When I am adding an entry to the database, I prefer to have the following data:

The bibliography entry in a master .bib file.
The pdf file.
Attach tags.

Search through all the pdf files in my database for phrase(s)

This one of those things that I didn’t know I can’t live without. When I read a paper, I come across ideas or phrases I don’t find useful until much later. When I do, all I can remember is that I had read that somewhere. On top of that, when I am exploring a new idea using this feature to find papers that talk about this helps to also see how the idea would relate other projects and ideas I have worked on. This allows me to quickly find other related work and ideas.

Taking notes on pdf files outside the pdf file (like in a plain text file or your favorite note taking software)

This may sound a little effy at first, but bare with me. Traditionally, you’d take note in the pdf file itself, often as marginalia. When you come back to the notes this helps to see the notes in the context it was taken. But the issue with most pdf note taking systems is, they are not searchable. If you are only working with a few pdf files, it won’t be too much of an issue. But when you have dozens/hundreds of pdf files with a ridiculous amount of notes, it’s impossible to search through them. Having them in a separate note taking tool allows to easily search through your notes. When doing that you also want to make sure there is a way for it to be easily find/link/situate them in the context it was taken. I myself haven’t figured out how this can be reliably done outside emacs, but I am sure someone out there would have (the notes field in the bib entry might be a possible solution?). Taking notes like this is also very handy when you are starting a new manuscript. I filter out papers I am want to build the manuscript around and export the notes of only these papers to a preferred format, which give me an instant draft of ideas I can start working on.

Two way access between master reference database and manuscripts

When working on any manuscript, you’d want to have access to the entries in the master database so that you can easily search add references. I write my manuscripts in latex and the master database is a bib file. So that allows my auto-completion framework to easily hook into the master database to add the reference without leaving the manuscript.

View notes/pdf files from manuscript

This is the last two points working together, and another reason to have both of those features. While writing a manuscript I find myself pulling up papers (opening the pdf file) and related notes of the literature I use in the manuscript. Being able to do that directly from the manuscript itself is very handy.

Query the database

When the database of papers becomes quite large, it becomes increasingly more important to be able to have a better solution to easily locate a paper/note. What I am describing is basically “advanced search”. For example, if I want to find a paper with a particular phrase (which is basically searching through the pdf files), had a particular phrase in the notes, has a tag I had associated with it and also filter by year, authors, or any other piece of information, I’d prefer the search functionality to have the flexibility to do that. Having that allows me to quickly narrow down the list of papers in my very large library of papers. Most “advanced search” functions don’t support something of this nature. A gripe I’ve had with many advanced search options with the different libraries/tools for searching through literature is, sometimes, all I need is a simple regex search on multiple fields, instead of a convoluted mess. This may sounds like a very niche thing, but once I had that working, it was another one of the things I’ve come to heavily rely upon.

The way I have set up my stack, when I need to search for paper(s), I use a querying system. Usually my entry point is one of the quick search functions that are wrappers to predefined queries. The nice thing about it being a query engine and having full access to it is that it allows me to modify the query and rerun the search, allowing me to iterate on my searches much faster. The query engine I use also supports easily extending it to do much more. Searching through pdf files, as I had described above, is one example of my making use of that feature. It also allows me to act on the entries that are the result of the query, like assign all of them a tag. The following could be considered a specific use case of that feature.

Easy export options

One of the shortcomings of my system is it doesn’t support explicit collaboration. Then again, I find everyone has their own workflow, and mostly prefer to have their own notes. On the other hand, It’s very rare you’d be working on a research project on your own, hence, you’d be sharing at least the list of papers you are looking at for a given project. Quite often I have to export something from my database, mostly to share it with other. I’ve had to export .bib entries, pdf files and notes, based on some query on my database. I had briefly touched on another reason for this with taking notes separate from the pdf file, I create the initial draft from the notes with this export functionality.

Connecting papers with ideas

When you are going through copious amounts of research articles, trying to remember everything and figuring out how they all connect to each other is not possible, at least not for me. There are much smarter people who have figured out how to do that more effectively and derived some really cool approaches to solve this knowledge management problem. Two such tools I’ve come across are concept-maps and zettelkasten note taking. The brain and roam research are examples tools of each approach respectively. They allow you to quickly create a network of knowledge, like a second brain. If you haven’t come across these, I’ll defer to the large number of tutorials and testimonials of them available out there. In my case the tags I had mentioned earlier are basically concepts in the concept-map and the notes I take are part of the zettle note taking system. Having the notes you take and the tags you create also be part of a knowledge management system like this would allow you to extract associations between papers much quicker and also refer to the papers from notes outside of the notes on the paper itself. Which is another reason to be taking notes outside of pdf files - so that you can use tools like these to refer to other papers, notes and concepts outside the paper itself.

crossref and connected papers

This more of a bonus. One of the packages I use allows me to search papers from the crossref database. I enter a title and it lists the most relevant papers. I’ve found some very interesting papers this way that I had almost missed by relying only on the online searches I do. Another tool that’s work mentioning is connected papers that Bradley introduced me to. It also helps find the most relevant papers, with an associated graph representation of the connections.

Wishlist

One thing I haven’t fully realized in my system is to easily extract papers that have been cited by another paper in my database. I still haven’t found a reliable way to extract that data either from pdf files itself or a public database/API.
When searching for new papers, there is no reliable way to search through the abstracts. I had tried downloading all abstracts by scrapping the acm digital library, but that resulted in my IP getting banned. Arpit et al. [^1] figured out a better way to do this with the dblp database as part of one of their papers. I’ve been trying to get their scraper to allow me to do this, as of the date of posting this, that’s still work in progress.

[^1]: A. Narechania, A. Karduni, R. Wesslen and E. Wall, “VITALITY: Promoting Serendipitous Discovery of Academic Literature with Transformers & Visual Analytics,” in IEEE Transactions on Visualization and Computer Graphics, vol. 28, no. 1, pp. 486-496, Jan. 2022, doi: 10.1109/TVCG.2021.3114820.