EXT_SORT some problems, we can not repeat the sortKey a record value of the sort, can be modified into a repeat of the sort?
DELIMITED_DATA_READER -> EXT_SORT -> DELIMITED_DATA_WRITER
<Node type="EXT_SORT" sortOrder="A" id="EXT_SORT_0" SorterInitialCapacity="140000" numberOfTapes="6" sortKey="Field0" />
<Node append="false" type="DELIMITED_DATA_WRITER" charset="UTF-8" id="DELIMITED_DATA_WRITER_0" fileURL="c:/test/target_sort.txt" />
<Node type="DELIMITED_DATA_READER" DataPolicy="Lenient" charset="UTF-8" id="DELIMITED_DATA_READER_0" numRecords="-1" skipRows="0" fileURL="c:/test/source.txt" />
<Edge metadata="EXT_SORT_0_0__MetaData" toNode="EXT_SORT_0:0" fromNode="DELIMITED_DATA_READER_0:0" />
<Edge metadata="DELIMITED_DATA_WRITER_0_0__MetaData" toNode="DELIMITED_DATA_WRITER_0:0" fromNode="EXT_SORT_0:0" />
source.txt documents have two fields, namely field0 and field1, to field0 sort, but the contents of documents field0 duplication, such as 0000000
Below are the contents of
0000000;005
0000001;006
0000000;007
0000002;004
target_sort.txt output
0000000;007
0000001;006
0000002;004
0000000;005
And the result is that I have to
0000000;005
0000000;007
0000001;006
0000002;004
DELIMITED_DATA_READER -> EXT_SORT -> DELIMITED_DATA_WRITER
<Node type="EXT_SORT" sortOrder="A" id="EXT_SORT_0" SorterInitialCapacity="140000" numberOfTapes="6" sortKey="Field0" />
<Node append="false" type="DELIMITED_DATA_WRITER" charset="UTF-8" id="DELIMITED_DATA_WRITER_0" fileURL="c:/test/target_sort.txt" />
<Node type="DELIMITED_DATA_READER" DataPolicy="Lenient" charset="UTF-8" id="DELIMITED_DATA_READER_0" numRecords="-1" skipRows="0" fileURL="c:/test/source.txt" />
<Edge metadata="EXT_SORT_0_0__MetaData" toNode="EXT_SORT_0:0" fromNode="DELIMITED_DATA_READER_0:0" />
<Edge metadata="DELIMITED_DATA_WRITER_0_0__MetaData" toNode="DELIMITED_DATA_WRITER_0:0" fromNode="EXT_SORT_0:0" />
source.txt documents have two fields, namely field0 and field1, to field0 sort, but the contents of documents field0 duplication, such as 0000000
Below are the contents of
0000000;005
0000001;006
0000000;007
0000002;004
target_sort.txt output
0000000;007
0000001;006
0000002;004
0000000;005
And the result is that I have to
0000000;005
0000000;007
0000001;006
0000002;004
-
I tried 2.5.0 version, this issue has been resolved :)
1)But also found other problems, metadata document does not support unicode encoded field name.
public void setName(String _name) {
if (!StringUtils.isValidObjectName(_name)) {
throw new InvalidGraphObjectNameException(_name, "FIELD");
}
this.name = _name;
}
private final static String OBJECT_NAME_PATTERN = "[_A-Za-z]+[_A-Za-z0-9]*";
2)ACCESS databases do not support the methods setHoldability
connection.setHoldability(ResultSet.CLOSE_CURSORS_AT_COMMIT); -
Sorry,DELIMITED_DATA_READER -> EXT_SORT -> DELIMITED_DATA_WRITER the problem has yet to be resolved.
===========================================
I found that in DELIMITED_DATA_READER read utf-8 encoded documents in question, the source coding format is utf-8, says 0000000;007
I do a test
DELIMITED_DATA_READER ( source file code UTF-8 ) - > DELIMITED_DATA_WRITER ( target file code GB2312 )
Output at the contents of more than a question mark
UNMAPPABLE[1] when converting to GB2312: '?0000000;007'
=============================================
I think the problem may be in EXT_SORT to read the data encoded on the -
Don't you have an "invisible" character on before the record? I can't reproduce the problem. -
I have the relevant information to your mailbox, please help me see. -
Hi,
there are some "invisible" characters in your source file and it causes both problems. To make the graph working properly you need to put a Reformat component after reader. Transformation should look like:
- this removes characters, which cause the problem.function transform() {
$0.Field0 := replace($0.Field0, '[^\p{ASCII}]*', '');
$0.Field1 := $0.Field1;
} -
The original documents are windows notepad generated through the Save as to generate the UTF-8 encoded file, this function should belong to a node DELIMITED_DATA_READER attribute settings,
The attributes similar to trim function, it can remove all "Invisible" characters.
<DELIMITED_DATA_READER removeInvisibleCharacters="true"/>
Please sign in to leave a comment.
Comments 6