|
Moodle
2.2.1
http://www.collinsharper.com
|
Go to the source code of this file.
Namespaces | |
| namespace | moodlecore |
Enumerations | |
| enum | MINIMUM_WORD_SIZE |
| enum | MAXIMUM_WORD_SIZE |
| Minimum word size to index and search. More... | |
| enum | START_DELIM |
| Maximum word size to index and search. More... | |
| enum | CENTER_DELIM |
| enum | END_DELIM |
| enum | PREG_CLASS_SEARCH_EXCLUDE |
| enum | PREG_CLASS_NUMBERS |
| enum | PREG_CLASS_PUNCTUATION |
| enum | PREG_CLASS_CJK |
Functions | |
| tokenise_text ($text, $stop_words=array(), $overlap_cjk=false, $join_numbers=false) | |
| tokenise_split ($text, $stop_words, $overlap_cjk, $join_numbers) | |
| tokenise_simplify ($text, $overlap_cjk, $join_numbers) | |
| tokenise_expand_cjk ($matches) | |
| tokenise_truncate_word (&$text) | |
| enum CENTER_DELIM |
Definition at line 60 of file tokeniserlib.php.
| enum END_DELIM |
Definition at line 61 of file tokeniserlib.php.
| enum MAXIMUM_WORD_SIZE |
Minimum word size to index and search.
Definition at line 57 of file tokeniserlib.php.
| enum MINIMUM_WORD_SIZE |
Some constants
Definition at line 56 of file tokeniserlib.php.
| enum PREG_CLASS_CJK |
Matches all CJK characters that are candidates for auto-splitting (Chinese, Japanese, Korean). Contains kana and BMP ideographs.
Definition at line 141 of file tokeniserlib.php.
| enum PREG_CLASS_NUMBERS |
Matches all 'N' Unicode character classes (numbers)
Definition at line 107 of file tokeniserlib.php.
Matches all 'P' Unicode character classes (punctuation)
Definition at line 120 of file tokeniserlib.php.
Matches Unicode character classes to exclude from the search index.
See: http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values
The index only contains the following character classes: Lu Letter, Uppercase Ll Letter, Lowercase Lt Letter, Titlecase Lo Letter, Other Nd Number, Decimal Digit No Number, Other
Definition at line 76 of file tokeniserlib.php.
| enum START_DELIM |
Maximum word size to index and search.
Definition at line 59 of file tokeniserlib.php.
| tokenise_expand_cjk | ( | $ | matches | ) |
Basic CJK tokeniser. Simply splits a string into consecutive, overlapping sequences of characters (MINIMUM_WORD_SIZE long).
Definition at line 375 of file tokeniserlib.php.
| tokenise_simplify | ( | $ | text, |
| $ | overlap_cjk, | ||
| $ | join_numbers | ||
| ) |
Simplifies a string according to indexing rules.
Definition at line 317 of file tokeniserlib.php.
| tokenise_split | ( | $ | text, |
| $ | stop_words, | ||
| $ | overlap_cjk, | ||
| $ | join_numbers | ||
| ) |
Some helper functions (should be considered private) Splits a string into tokens
Definition at line 287 of file tokeniserlib.php.
| tokenise_text | ( | $ | text, |
| $ | stop_words = array(), |
||
| $ | overlap_cjk = false, |
||
| $ | join_numbers = false |
||
| ) |
This function process the text passed at input, extracting all the tokens and scoring each one based in their number of ocurrences and relation with some well-known html tags
| string | $text | the text to be tokenised. |
| array | $stop_words | array of utf-8 words than can be ignored in the text being processed. There are some cool lists of stop words at http://snowball.tartarus.org/ |
| boolean | $overlap_cjk | option to split CJK text into some overlapping tokens is order to allow them to be searched. Useful to build indexes and search systems. |
| boolean | $join_numbers | option to join in one unique token sequences of numbers separated by puntuaction chars. Useful to build indexes and search systems. |
Definition at line 162 of file tokeniserlib.php.
| tokenise_truncate_word | ( | &$ | text | ) |
Helper function for array_walk in search_index_split. Truncates one string (token) to MAXIMUM_WORD_SIZE
Definition at line 406 of file tokeniserlib.php.