netkit.graph.io
Class SplitParser

java.lang.Object
  extended by netkit.graph.io.SplitParser

public abstract class SplitParser
extends java.lang.Object

This class enables parsing lines of text using regular expression patterns that must match an entire line. The regex is expected to contain "capturing groups", with at least one such group for each expected field in the line. You can either make use of the supplied parsers and their respective patterns or subclass this object and define your own patterns and captured group extraction method (if necessary). This class aims to be faster than and create less garbage than using Pattern.split() on large input files.

Author:
Kaveh R. Ghazi
See Also:
Pattern, Matcher

Field Summary
protected  java.lang.String[] fields
          A String array used for returning the parsed input.
protected  java.util.regex.Matcher matcher
          The Matcher object used to parse each line.
 
Constructor Summary
protected SplitParser(int fieldNum, java.lang.String patternStart, java.lang.String patternMiddle, java.lang.String patternEnd)
          The protected constructor which creates the field holder array and assembles the regular expression for matching lines.
 
Method Summary
static SplitParser getParserCOMMA(int fieldNum)
          Gets a parser that parses lines containing comma separated values, no whitespace allowed.
static SplitParser getParserCOMMAWS(int fieldNum)
          Gets a parser that parses lines containing comma separated values, possibly surrounded with whitespace; whitespace is ignored/removed.
static SplitParser getParserCSV(int fieldNum)
          Gets a parser that parses lines containing comma separated values possibly wrapped with double quotes, possibly surrounded with whitespace.
static SplitParser getParserWS(int fieldNum)
          Gets a parser that parses lines containing whitespace separated values, with arbitrary extra whitespace between values.
static SplitParser getParserWS1(int fieldNum)
          Gets a parser that parses lines containing whitespace separated values, separators are exactly one character and no extra whitespace appears anywhere in the supplied lines.
 java.lang.String getRegex()
          Gets a String representation of the regular expression defined for the Matcher object used on each line.
static void main(java.lang.String[] args)
           
 java.lang.String[] parseLine(java.lang.CharSequence line)
          Parses a line of text according to the regex defined for this parser and splits it into an array of String.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

fields

protected final java.lang.String[] fields
A String array used for returning the parsed input. The captured regex groups are placed into this array. This array is reused for each line to avoid creating excess garbage for the collector, so clients must extract the elements and store them elsewhere before parsing the next line. Access to this field is made available to subclasses which elect to override the default parseLine(CharSequence) method.


matcher

protected final java.util.regex.Matcher matcher
The Matcher object used to parse each line. It contains the Pattern and regex used for matching. Access to this field is made available to subclasses which elect to override the default parseLine(CharSequence) method.

Constructor Detail

SplitParser

protected SplitParser(int fieldNum,
                      java.lang.String patternStart,
                      java.lang.String patternMiddle,
                      java.lang.String patternEnd)
The protected constructor which creates the field holder array and assembles the regular expression for matching lines. Subclasses must supply to this the number of fields expected in each line, and the pieces to construct the regex.

Parameters:
fieldNum - and int specifying the number of fields in each line of parsed text.
patternStart - a String which begins the regex pattern; it normally contains the capturing group for the first field.
patternMiddle - a String which is appended fieldNum-1 times after the patternStart in the regex pattern; it normally contains a capturing group used for any fields after the first.
patternEnd - a String which is appended to the end of the regex pattern; it normally contains any regex suffix necessary to terminate the expression.
Method Detail

parseLine

public java.lang.String[] parseLine(java.lang.CharSequence line)
Parses a line of text according to the regex defined for this parser and splits it into an array of String. This default implementation assumes the regex has one matched "group" captured per expected field in the supplied line of text. Override this if the pattern has multiple regex groups per field and extract them accordingly.

Parameters:
line - a CharSequence containing the line of text to split.
Returns:
an array of String containing the matching fields from the supplied line of text; this array is reused for each call to this method.
Throws:
java.lang.RuntimeException - if the supplied line doesn't match this parser's regex.

getRegex

public final java.lang.String getRegex()
Gets a String representation of the regular expression defined for the Matcher object used on each line. The expression will be specific to the number of fields defined for this parser object. This is mainly used to verify the constructed expression if desired.

Returns:
a String representation of the regular expression defined for the Matcher object used on each line.

getParserWS1

public static final SplitParser getParserWS1(int fieldNum)
Gets a parser that parses lines containing whitespace separated values, separators are exactly one character and no extra whitespace appears anywhere in the supplied lines.

Parameters:
fieldNum - the number of fields expected per line.
Returns:
a parser that parses lines containing whitespace separated values.

getParserWS

public static final SplitParser getParserWS(int fieldNum)
Gets a parser that parses lines containing whitespace separated values, with arbitrary extra whitespace between values.

Parameters:
fieldNum - the number of fields expected per line.
Returns:
a parser that parses lines containing whitespace separated values, with arbitrary extra whitespace between values.

getParserCOMMA

public static final SplitParser getParserCOMMA(int fieldNum)
Gets a parser that parses lines containing comma separated values, no whitespace allowed.

Parameters:
fieldNum - the number of fields expected per line.
Returns:
a parser that parses lines containing comma separated values, no whitespace allowed.

getParserCOMMAWS

public static final SplitParser getParserCOMMAWS(int fieldNum)
Gets a parser that parses lines containing comma separated values, possibly surrounded with whitespace; whitespace is ignored/removed.

Parameters:
fieldNum - the number of fields expected per line.
Returns:
a parser that parses lines containing comma separated values.

getParserCSV

public static final SplitParser getParserCSV(int fieldNum)
Gets a parser that parses lines containing comma separated values possibly wrapped with double quotes, possibly surrounded with whitespace. This allows whitespace in the values and is a more accurate CSV format.

Parameters:
fieldNum - the number of fields expected per line.
Returns:
a parser that parses lines containing comma separated values possibly wrapped with double quotes, possibly surrounded with whitespace.

main

public static void main(java.lang.String[] args)