Old Geeks Rule

Thursday, November 13, 2014

Using a Spreadsheet for Document Management

So you have a small batch of pdf files that you would like to add some meta-data to, cull through the corpus for specifics and easily link to the original files.

But all you’ve got is a File System and a Spreadsheet Program .

That is easy enough to do.

(The example that I am going to create is one I used for handling Juror Questionnaires in a recent civil case I was involved in. So the column headings will reference generic type juror questions. I will also have two sets of data, one that is partial (Annotated) and one that is a final complete set (Clean) There is no identifying data or actual case information.)

The File System.

Get all your pdf files in to a set directory (what we will later refer to as the “root directory”) on your c:\ drive.

I’m going to use c:\temp\ as my directory.

I have two sub-directories in here: Annotated and Clean. Both locations have files.

Get a text file of the files and the file path.

Use Command Prompt.
Get to the main directory (c:\temp\Original)
Use “dir *.pdf /ON /B /S > FileList.txt”

My file looks like this (using Notepad++)

We do not want the drive letter or extraneous path, so strip that info out.

Use a columnar delete or Find/Replace process.
Leave a backslash as the first character of each line.
Now my file no longer has that info:

The spreadsheet.

Create your spreadsheet.
Name sheet one “Document Management”.

Label your rows with the meta-data types about the corpus such as; Date of Document, Author, Checked, etc.

This is actually lacking in my data, as all the unique data is within the documents, vice being about the documents.

And label rows as needed about things the documents contain such as; Loss Amount, Gender, Age, Address, etc.

This is where all the information was contained in this project.

Optionally add a “key” column (field) that may match to other datasets.
Finally, create a column that will hold the links to the pdf files themselves.

In my project, I had two columns: Clean (column E) and Annotated (column F):

Name sheet two “Documents”.

Enter “DocumentPathAndName” into cell A1

I have two sets of pdf files that apply to one or more jurors.

I created two columns. One for path and file for the Clean versions (A1) and one for the path and file for the Annotated versions (B1)

Enter “DriveLetter” into cell C1.

Enter the drive letter containing the files into cell D2.

In my project “c:”

Enter “PathToRoot” into cell E1

Enter the path from the drive letter to the root folder into cell E2.

In my project “\temp”

Import FileList.txt or insert the information from FileList,txt into column A, starting in row 2 down.

Either way can be used. We want each line of the file to be in one cell in separate rows.
I did this twice. Once for the Annotated files and once for the Clean files.

This is what my sheet looked like:

Now let’s put a formula in the Document Management sheet to link these records to the appropriate pdf files.

In cell E2 enter “=HYPERLINK(CONCATENATE(Documents!$D$1,Documents!$F$1,Documents!B2))”

You can change what is displayed in the cell by adding a comma between the last two closing parenthesis and populating it like this: “,CONCATENATE("Juror Number ",A2," Clean")”
This will now display “Juror Number 1 Clean” and increment as appropriate.

In my project I redid this for the Annotated versions, with the appropriate changes.
Microsoft gives a warning whenever hyperlinks are clicked in Office programs. This Support KnowledgeBase article discusses how to change that action: support.microsoft.com/kb/925757.

Now that we have our spreadsheet setup and populate the cells with data, we can use (in Excel) the Data/Filter option.

With this setup, we can apply filtering and sorting to any or all of the columns to find specific sets of results. From there it is a simply click on the link to view the actual pdf files.

I know this article is kind of long and the graphics may be a bit small; but hopefully you can see how to implement a quick and dirty document management process can be done using on the File System and a Spreadsheet.

Feel free to pass along any comments or check with me if you have any specific projects this methodology could be used for.

chuck

Wednesday, October 15, 2014

Intricate SQL statement to expose workflow status

Ok, here's a geeky sql statement I had to compose for web page and how it works.

SELECT

dbo.ProductionTicket.JCNumber AS JCNumber,

CONVERT(date, dbo.ProductionTicket.DateReceived) AS Received,

dbo.ProductionTicket.Description AS Description,

'Production' AS Job,

CONVERT(date, dbo.ProductionTicket.DateCompleted) AS DateCompleted,

CASE dbo.ProductionTicket.Closed
WHEN 'False' THEN 'Open'
WHEN 'True' THEN 'Completed' END AS Completed

FROM dbo.ProductionTicket
WHERE (dbo.ProductionTicket.Closed = 'False')
OR
((dbo.ProductionTicket.Closed = 'True')
AND (dbo.ProductionTicket.DateCompleted > GETDATE() - 30))

UNION ALL

SELECT

CAST(dbo.TranscriptProcess.ID AS nvarchar(15)) AS JCNumber,

CONVERT(date, dbo.TranscriptProcess.DTReceived) AS Received,

LTRIM(RTRIM((ISNULL(dbo.TranscriptProcess.NameFirst, '') + ' ' + ISNULL(dbo.TranscriptProcess.NameLast, '')))) AS Description,

'Transcription' AS Job,

CONVERT(date, dbo.TranscriptProcess.DTFromReader) AS DateCompleted,

CASE dbo.TranscriptProcess.JobStatus
WHEN '19 - Returned to Client' THEN 'Completed'
WHEN '20 - Cancelled by Client' THEN 'Cancelled'
ELSE 'Open'
END AS Completed

FROM dbo.TranscriptProcess

WHERE ((dbo.TranscriptProcess.JobStatus = '19 - Returned to Client')
AND (dbo.TranscriptProcess.DTDUE > GETDATE() - 30))
OR
(dbo.TranscriptProcess.JobStatus <> '19 - Returned to Client'
AND dbo.TranscriptProcess.JobStatus <> '20 - Cancelled by Client')

This is designed to return various in-process jobs from two different tables and give a single view result.

So, there are two queries (the two SELECT statements) combined (the UNION ALL statement) to one record set.

Both SELECT statements return six fields (JCNumber, Received, Description, Job, DateCompleted and Completed).

Since the two tables are not identical, there is some data massaging going on prior to the UNION ALL step.

The first SELECT statement fields in order:

Simply applies a specific name to the field JCNumber.
This field needs to be CONVERTed to a Date format and named Received.
This is just named Description
This is just named Job.
Another Date CONVERT and renamed to DateCompleted.
This needs boolean values CASEd to Open and Completed and named Completed.

These records are coming from a table ProductionTicket where (the Closed field is either False) OR (the Closed field is True AND the date is within the last 30 days). So it gets all still open records and any records that have been closed within the last 30 days.

The second SELECT statement fields in order:

First the field has to be CAST as a 15 character nvarchar, then named JCNumber.

Needs to match the field from the first SELECT statement.

This field needs to be CONVERTed to a Date format and named Received.
This field is generated by the concatenation of two NULLable fields (FirstName and LastName) with a space between with any leading (LTRIM) and trailing (RTRIM) spaces removed; and named Description.
This is just named Job.
Another Date CONVERT and named to DateCompleted.
Another CASE process, except this time we are pealing off two specific values and the remaining ones are set to Open; and named Completed.

These records are coming from a table TranscriptProcess where (the JobStatus is returned to client AND the difference between the Due date and now is less than 30) OR (the JobStatus isn't Returned to Client AND isn't Cancelled by Client). So all jobs returned to the client with a Due date within the last 30 days (so we don't show long ago jobs) and all currently open jobs (not returned or cancelled).

Just wanted to show how convoluted SQL queries can be de-composed into easy to understand bits.

Nothing like explaining something to blow the cobwebs from the mind.

I like it.

chuck

Friday, October 10, 2014

More RegEx Recipes for Text Processing

Yesterday I wrote a piece on using some elementary Find-and-Replace stuff using Notepad++ for handling an issue that occasionally arises in transcript handling.

On this post I thought I'd add some real RegEx recipes for more advanced text handling.

(I will be using two programs, the first is the previously mentioned Notepad++. It is a free, as in beer, software that I use extensively. The other is TextPad by Helios Software Solutions. It is only about $30 for a license, but if you need the advanced features, it is WELL worth the price. I will indicate which recipes must use TextPad, otherwise the default is using Notepad++.)

(I will refer to these RegEx processes as recipes, meaning there is a Find portion and a Replace portion. I will use quotes to make reading easier, but the start and end quotes are not meant to be used in the actual recipe. I will indicate a space as <space>, since those tend to be difficult to read on line. For example; Find: "\r\n<space>\r\n". Also I will use "F:" and "R:" instead of using all the extra letters of "Find:" and "Replace:" And obviously, unless noted, the quotes are NOT part of the recipes and are included solely to make it easier to read.)

Alright, first an explanation of the recipe for replacing the erroneous EOL characters from the other blog.

That recipe was F: "\r\r\n" R: "\r\n\r\n".
     Explanation: Find the lonely Carriage Return followed by the correct EOL sequence (\r\r\n) and replace it with the correct EOL sequence - twice (\r\n\r\n).
     Summary: Clean up files with this odd combination of EOLs.

Another version of this issue is a file that ONLY has Line Feed characters at the end of EVERY line.
This recipe would be F: "\n" R: "\r\n"
(This explanation and summary is left as an exercise for the reader.)

The recipe to remove double-spacing is F: "\r\n\r\n" R: "\r\n".
     Explanation: Find two ASCII EOL markers in a sequence (\r\n\r\n) and replace with a single EOL (\r\n).
     Summary: Remove all empty lines, such as found in a double-spaced file.

Trim trailing whitespace F: "[<space>\t]+$" R: ""
     Explanation: Find and select any and all "+" spaces "<space>" or tabs "\t" that occur just before the end of the line "$" and replace them all "[]" with nothing.
     Summary: Trim the end of a line of extraneous whitespace.

Now the follow on to trim leading whitespace F: "^[<space>\t]+" R: ""
     Explanation, same as above, except "^" means the start of a line.

To find most problem non-ANSI characters use F: "[^\xd,\xa,\x20-\x7a]"
     ANSI characters are the printing characters and lower non-printing ones. The "\x20-\x7a" catches the upper problem characters.

Thanks and have a great day.

chuck

Saturday, October 4, 2014

Hello World!

Welcome to the my new blog.

I already write one for the firm I work for, Visionary Legal Technologies, as an advanced topic support and software oriented blog.

This one is for non-Visionary oriented software posts.

The title is from a t-shirt I was given as a gift that provides a quick phrase.

I may occasionally refer to posts on this site from the other, Visionary Tech, to allow me to write more in-depth pieces. I do not expect there to be much duplication in content.

With that said, I'm off; follow along as you wish.

chuck

"I can explain it to you, but I cannot make you understand."