Preprocessing a file to remove invalid characters

CloverDX Customer Portal
Forum
Preprocessing a file to remove invalid characters

proberge

November 17, 2015 00:00

Answered

I am attempting to build a graph that does some fairly simple operations (this will be a pattern for a number of graphs):

Remove a set of invalid characters from a gzip'ed data file (CSV)

Read/parse the CSV data

Add fields to the data stream using Reformat

Write the data stream to a database table

I'm stuck at 1. It seems that the only ways to do this are a) employ a system command and use sed, perl, etc to create a fixed version of the file, then read the file normally, or b) read the CSV then iterate through all of the fields, removing the invalid characters from each.

I'm not wildly crazy about either. What I'd like to do is the following, but I don't see a combination of components and/or functions that will allow me to do it:

Read the file s.t. each line is 1 record

Remove invalid characters from each record with replace()

Parse each record as a CSV into the necessary fields

Do the balance of the transformation...

Perhaps I'm making this too hard and should just accept having to do the preprocessing using a shell command.

Thanks in advance,
-Phil

Comments 3

slechtaj

November 18, 2015 16:35
0

Comment actions Permalink
Hi Phil,

What you can do is to read the whole records into one field and then parse the record in another component (Reformat). In other words, the output edge connected to the reader should contain just one field (the whole record with field delimiters).
The process might look like as follows:
- Read rows from gzipped CSV files.
  [list:1dz88vme]
  [*:1dz88vme]You might use appropriate URL format so you don't have to extract the files.
- Make sure the output metadata has just one field with record delimiter – this way you will read the whole record into one field.
[/*:m:1dz88vme]
Attach the reader's output edge to Reformat component.
Using Reformat's Transform attribute parse the record the way you want.
[*:1dz88vme]Here you can call replace() function to get rid of all unwanted character combinations. Please note the replace() function works basically the same way as Java replaceAll() method.
[/*:m:1dz88vme][/list:o:1dz88vme]
proberge

November 18, 2015 17:06
0

Comment actions Permalink
Jan,

Thanks for the response. I do use the gzip:() form of the file URL in the UniversalDataReader.

I have contemplated what you suggest, but once I have done the reformat/replace to eliminate the invalid characters on entire records, I need to parse each (CSV) record into the various fields of the final record. Do I need to do a READ->REFORMAT->WRITE(to file) followed by a READ(corrected records, parsing CSV)->...balance of processing..., or is there a function/component where I can do the CSV parse of the corrected records. I don't see that Reformat will do that unless I use something like split(), which is not really intended to parse a CSV.

Thanks,
-Phil
slechtaj

November 20, 2015 19:29
0

Comment actions Permalink
Hi Phil,

You have basically two options:
- First is what you have already mentioned - parsing the record manually using the Reformat transformation.
- The second option, which is from my point of view better solution, is sending the cleaned records to UniversalDataReader directly from the Reformat. You don't have to write the data to an output file, since all readers can read data from the input port. In this case, the File URL attribute of the UniversalDataReader would have the following value:
  port:$0.content:discrete
You might find some useful information in Clover documentation under the following links:
- Input port reading - pay a special attention to processing types
- UniversalDataReader
I have also created an example that might help you better understand the solution.

ParseData.zip
- ParseData.zip

Please sign in to leave a comment.

Quick links

Access my products

SUPPORT & SERVICES

Community

RESOURCES