|
FreeLing
3.0
|
Class tokenizer implements a token splitter, which converts a string into a sequence of word objects, according to a set of tokenization rules read from aconfiguration file. More...
#include <tokenizer.h>
Public Member Functions | |
| tokenizer (const std::wstring &) | |
| Constructor. | |
| void | tokenize (const std::wstring &, std::list< word > &) |
| tokenize string | |
| std::list< word > | tokenize (const std::wstring &) |
| tokenize string, return result as list | |
| void | tokenize (const std::wstring &, unsigned long &, std::list< word > &) |
| tokenize string, tracking offset | |
| std::list< word > | tokenize (const std::wstring &, unsigned long &) |
| tokenize string, tracking offset, return result as list | |
Private Attributes | |
| std::set< std::wstring > | abrevs |
| abreviations set (Dr. Mrs. etc. period is not separated) | |
| std::list< std::pair < std::wstring, boost::u32regex > > | rules |
| tokenization rules | |
| std::map< std::wstring, int > | matches |
| substrings to convert into tokens in each rule | |
Class tokenizer implements a token splitter, which converts a string into a sequence of word objects, according to a set of tokenization rules read from aconfiguration file.
| tokenizer::tokenizer | ( | const std::wstring & | tokFile | ) |
Constructor.
Create a tokenizer, using the abreviation and patterns file indicated in given options.
References ERROR_CRASH, util::open_utf8_file(), and TRACE.
| void tokenizer::tokenize | ( | const std::wstring & | , |
| std::list< word > & | |||
| ) |
tokenize string
| std::list<word> tokenizer::tokenize | ( | const std::wstring & | ) |
tokenize string, return result as list
| void tokenizer::tokenize | ( | const std::wstring & | , |
| unsigned long & | , | ||
| std::list< word > & | |||
| ) |
tokenize string, tracking offset
| std::list<word> tokenizer::tokenize | ( | const std::wstring & | , |
| unsigned long & | |||
| ) |
tokenize string, tracking offset, return result as list
std::set<std::wstring> tokenizer::abrevs [private] |
abreviations set (Dr. Mrs. etc. period is not separated)
std::map<std::wstring,int> tokenizer::matches [private] |
substrings to convert into tokens in each rule
std::list<std::pair<std::wstring,boost::u32regex> > tokenizer::rules [private] |
tokenization rules
1.7.6.1