Filtering Invalid Characters, Notrix

I have two databases. A main one with thousands of documents. In these documents are a few parrallel multi-valued fields. Namely an ID field and a Title which represents a document in the second database.

Unfortunately the Titles for the documents in the second database have been entered manually from random sources. Im pretty sure one source is MS Word. Looks like people copy and pasted text from Word into a Notes field. So that is causing a lot of carriage returns, tabs, and those weird quotes that Word uses.

My problem occurs when I try to move this data over to DB2. The table in each document is moved over to rows in DB2 via Notrix and the Notrix job that does this fails because DB2 can’t accept those special/invalid characters in a column.

Im trying to come up with the best way to solve this.

The second database that has the documents for the ID/Titles has been manually cleaned. Well probably. I havent checked this myself. But its said to be clean now.

So what someone wants to do it just refresh the several thousand documents by

  1. going through each Title/ID doc in the second database and checking every char to be sure its between ascii code 31 and 128. If its not add the ID to a list

  2. go throuh every document in the main database and checking to see if that document contains any of the IDs in the “Has a bad char” list.

  3. Finally take that list of bad docs and refresh all the titles in the doc

I was thinking isnt there an easy way to validate the fields in the Notrix Job? But due to the volume I dont want to do some giant @ReplaceSubString on every single document. Im sure that would make this nightly Notrix job crawl a good deal.

Im all for cleaning the data, but some preventitive maintanence sounds required. The procedure for entering these titles hasnt changed so I dont see any reason why people wont continue to copy and paste from word.

Maybe a weekly agent or notrix job to scroll through the Title database and check for invalid characters, make a list of updated ones. Then a weekly agent on the main database to go through the documents and see if they have any of the corresponding updated Titles and update them?

I dunno just would love some input here.