Thursday, November 13, 2014

Using a Spreadsheet for Document Management

So you have a small batch of pdf files that you would like to add some meta-data to, cull through the corpus for specifics and easily link to the original files.

But all you’ve got is a File System and a Spreadsheet Program .

That is easy enough to do.

(The example that I am going to create is one I used for handling Juror Questionnaires in a recent civil case I was involved in.  So the column headings will reference generic type juror questions.  I will also have two sets of data, one that is partial (Annotated) and one that is a final complete set (Clean)  There is no identifying data or actual case information.)

The File System.

  1. Get all your pdf files in to a set directory (what we will later refer to as the “root directory”) on your c:\ drive.
    1. I’m going to use c:\temp\ as my directory.
      1. I have two sub-directories in here:  Annotated and Clean.  Both locations have files.
    2. Get a text file of the files and the file path.
      1. Use Command Prompt.
      2. Get to the main directory (c:\temp\Original)
      3. Use “dir *.pdf /ON /B /S > FileList.txt”
        1. My file looks like this (using Notepad++)image
    3. We do not want the drive letter or extraneous path, so strip that info out.
      1. Use a columnar delete or Find/Replace process.
      2. Leave a backslash as the first character of each line.
      3. Now my file no longer has that info:image


The spreadsheet.

  1. Create your spreadsheet.
  2. Name sheet one “Document Management”.
    1. Label your rows with the meta-data types about the corpus such as; Date of Document, Author, Checked, etc.
      1. This is actually lacking in my data, as all the unique data is within the documents, vice being about the documents.
    2. And label rows as needed about things the documents contain such as; Loss Amount, Gender, Age, Address, etc.
      1. This is where all the information was contained in this project.
    3. Optionally add a “key” column (field) that may match to other datasets.
    4. Finally, create a column that will hold the links to the pdf files themselves.
      1. In my project, I had two columns:  Clean (column E) and Annotated (column F):image
  3. Name sheet two “Documents”.
    1. Enter “DocumentPathAndName” into cell A1
      1. I have two sets of pdf files that apply to one or more jurors.
        1. I created two columns.  One for path and file for the Clean versions (A1) and one for the path and file for the Annotated versions (B1)
    2. Enter “DriveLetter” into cell C1.
      1. Enter the drive letter containing the files into cell D2.
        1. In my project “c:”
    3. Enter “PathToRoot” into cell E1
      1. Enter the path from the drive letter to the root folder into cell E2.
        1. In my project “\temp”
    4. Import FileList.txt or insert the information from FileList,txt into column A, starting in row 2 down.
      1. Either way can be used.  We want each line of the file to be in one cell in separate rows.
      2. I did this twice.  Once for the Annotated files and once for the Clean files.
    5. This is what my sheet looked like:image
  4. Now let’s put a formula in the Document Management sheet to link these records to the appropriate pdf files.
    1. In cell E2 enter “=HYPERLINK(CONCATENATE(Documents!$D$1,Documents!$F$1,Documents!B2))”
      1. You can change what is displayed in the cell by adding a comma between the last two closing parenthesis and populating it like this:  “,CONCATENATE("Juror Number ",A2," Clean")”
      2. This will now display “Juror Number 1 Clean” and increment as appropriate.
    2. In my project I redid this for the Annotated versions, with the appropriate changes.
    3. Microsoft gives a warning whenever hyperlinks are clicked in Office programs.  This Support KnowledgeBase article discusses how to change that action:  support.microsoft.com/kb/925757.


Now that we have our spreadsheet setup and populate the cells with data, we can use (in Excel) the Data/Filter option.

With this setup, we can apply filtering and sorting to any or all of the columns to find specific sets of results.  From there it is a simply click on the link to view the actual pdf files.

image


I know this article is kind of long and the graphics may be a bit small; but hopefully you can see how to implement a quick and dirty document management process can be done using on the File System and a Spreadsheet.


Feel free to pass along any comments or check with me if you have any specific projects this methodology could be used for.


chuck