Customer Portal

Preprocessing a file to remove invalid characters

Comments 3

  • Avatar
    slechtaj
    0
    Comment actions Permalink
    Hi Phil,

    What you can do is to read the whole records into one field and then parse the record in another component (Reformat). In other words, the output edge connected to the reader should contain just one field (the whole record with field delimiters).
    The process might look like as follows:

    • Read rows from gzipped CSV files.
      [list:1dz88vme]
      [*:1dz88vme]You might use appropriate URL format so you don't have to extract the files.

    • Make sure the output metadata has just one field with record delimiter – this way you will read the whole record into one field.
    [/*:m:1dz88vme]
  • Attach the reader's output edge to Reformat component.

  • Using Reformat's Transform attribute parse the record the way you want.
      [*:1dz88vme]Here you can call replace() function to get rid of all unwanted character combinations. Please note the replace() function works basically the same way as Java replaceAll() method.
  • [/*:m:1dz88vme][/list:o:1dz88vme]
  • Avatar
    proberge
    0
    Comment actions Permalink
    Jan,

    Thanks for the response. I do use the gzip:() form of the file URL in the UniversalDataReader.

    I have contemplated what you suggest, but once I have done the reformat/replace to eliminate the invalid characters on entire records, I need to parse each (CSV) record into the various fields of the final record. Do I need to do a READ->REFORMAT->WRITE(to file) followed by a READ(corrected records, parsing CSV)->...balance of processing..., or is there a function/component where I can do the CSV parse of the corrected records. I don't see that Reformat will do that unless I use something like split(), which is not really intended to parse a CSV.

    Thanks,
    -Phil
  • Avatar
    slechtaj
    0
    Comment actions Permalink
    Hi Phil,

    You have basically two options:

    • First is what you have already mentioned - parsing the record manually using the Reformat transformation.

    • The second option, which is from my point of view better solution, is sending the cleaned records to UniversalDataReader directly from the Reformat. You don't have to write the data to an output file, since all readers can read data from the input port. In this case, the File URL attribute of the UniversalDataReader would have the following value:
      port:$0.content:discrete


    You might find some useful information in Clover documentation under the following links:


    I have also created an example that might help you better understand the solution.

    ParseData.zip

Please sign in to leave a comment.