So you have a small batch of pdf files that you would like to add some meta-data to, cull through the corpus for specifics and easily link to the original files.
But all you’ve got is a File System and a Spreadsheet Program .
That is easy enough to do.
(The example that I am going to create is one I used for handling Juror Questionnaires in a recent civil case I was involved in. So the column headings will reference generic type juror questions. I will also have two sets of data, one that is partial (Annotated) and one that is a final complete set (Clean) There is no identifying data or actual case information.)
The File System.
- Get all your pdf files in to a set directory (what we will later refer to as the “root directory”) on your c:\ drive.
- I’m going to use c:\temp\ as my directory.
- I have two sub-directories in here: Annotated and Clean. Both locations have files.
- Get a text file of the files and the file path.
- Use Command Prompt.
- Get to the main directory (c:\temp\Original)
- Use “dir *.pdf /ON /B /S > FileList.txt”
- We do not want the drive letter or extraneous path, so strip that info out.
The spreadsheet.
- Create your spreadsheet.
- Name sheet one “Document Management”.
- Label your rows with the meta-data types about the corpus such as; Date of Document, Author, Checked, etc.
- This is actually lacking in my data, as all the unique data is within the documents, vice being about the documents.
- And label rows as needed about things the documents contain such as; Loss Amount, Gender, Age, Address, etc.
- This is where all the information was contained in this project.
- Optionally add a “key” column (field) that may match to other datasets.
- Finally, create a column that will hold the links to the pdf files themselves.
- Name sheet two “Documents”.
- Enter “DocumentPathAndName” into cell A1
- I have two sets of pdf files that apply to one or more jurors.
- I created two columns. One for path and file for the Clean versions (A1) and one for the path and file for the Annotated versions (B1)
- Enter “DriveLetter” into cell C1.
- Enter the drive letter containing the files into cell D2.
- In my project “c:”
- Enter “PathToRoot” into cell E1
- Enter the path from the drive letter to the root folder into cell E2.
- In my project “\temp”
- Import FileList.txt or insert the information from FileList,txt into column A, starting in row 2 down.
- Either way can be used. We want each line of the file to be in one cell in separate rows.
- I did this twice. Once for the Annotated files and once for the Clean files.
- This is what my sheet looked like:
- Now let’s put a formula in the Document Management sheet to link these records to the appropriate pdf files.
- In cell E2 enter “=HYPERLINK(CONCATENATE(Documents!$D$1,Documents!$F$1,Documents!B2))”
- You can change what is displayed in the cell by adding a comma between the last two closing parenthesis and populating it like this: “,CONCATENATE("Juror Number ",A2," Clean")”
- This will now display “Juror Number 1 Clean” and increment as appropriate.
- In my project I redid this for the Annotated versions, with the appropriate changes.
- Microsoft gives a warning whenever hyperlinks are clicked in Office programs. This Support KnowledgeBase article discusses how to change that action: support.microsoft.com/kb/925757.
Now that we have our spreadsheet setup and populate the cells with data, we can use (in Excel) the Data/Filter option.
With this setup, we can apply filtering and sorting to any or all of the columns to find specific sets of results. From there it is a simply click on the link to view the actual pdf files.
I know this article is kind of long and the graphics may be a bit small; but hopefully you can see how to implement a quick and dirty document management process can be done using on the File System and a Spreadsheet.
Feel free to pass along any comments or check with me if you have any specific projects this methodology could be used for.
chuck