Moodle  2.2.1
http://www.collinsharper.com
C:/xampp/htdocs/moodle/lib/tokeniserlib.php File Reference

Go to the source code of this file.

Namespaces

namespace  moodlecore

Enumerations

enum  MINIMUM_WORD_SIZE
enum  MAXIMUM_WORD_SIZE
 Minimum word size to index and search. More...
enum  START_DELIM
 Maximum word size to index and search. More...
enum  CENTER_DELIM
enum  END_DELIM
enum  PREG_CLASS_SEARCH_EXCLUDE
enum  PREG_CLASS_NUMBERS
enum  PREG_CLASS_PUNCTUATION
enum  PREG_CLASS_CJK

Functions

 tokenise_text ($text, $stop_words=array(), $overlap_cjk=false, $join_numbers=false)
 tokenise_split ($text, $stop_words, $overlap_cjk, $join_numbers)
 tokenise_simplify ($text, $overlap_cjk, $join_numbers)
 tokenise_expand_cjk ($matches)
 tokenise_truncate_word (&$text)

Enumeration Type Documentation

Definition at line 60 of file tokeniserlib.php.

enum END_DELIM

Definition at line 61 of file tokeniserlib.php.

Minimum word size to index and search.

Definition at line 57 of file tokeniserlib.php.

Some constants

Definition at line 56 of file tokeniserlib.php.

Matches all CJK characters that are candidates for auto-splitting (Chinese, Japanese, Korean). Contains kana and BMP ideographs.

Definition at line 141 of file tokeniserlib.php.

Matches all 'N' Unicode character classes (numbers)

Definition at line 107 of file tokeniserlib.php.

Matches all 'P' Unicode character classes (punctuation)

Definition at line 120 of file tokeniserlib.php.

Matches Unicode character classes to exclude from the search index.

See: http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values

The index only contains the following character classes: Lu Letter, Uppercase Ll Letter, Lowercase Lt Letter, Titlecase Lo Letter, Other Nd Number, Decimal Digit No Number, Other

Definition at line 76 of file tokeniserlib.php.

Maximum word size to index and search.

Definition at line 59 of file tokeniserlib.php.


Function Documentation

tokenise_expand_cjk ( matches)

Basic CJK tokeniser. Simply splits a string into consecutive, overlapping sequences of characters (MINIMUM_WORD_SIZE long).

Definition at line 375 of file tokeniserlib.php.

tokenise_simplify ( text,
overlap_cjk,
join_numbers 
)

Simplifies a string according to indexing rules.

Definition at line 317 of file tokeniserlib.php.

tokenise_split ( text,
stop_words,
overlap_cjk,
join_numbers 
)

Some helper functions (should be considered private) Splits a string into tokens

Definition at line 287 of file tokeniserlib.php.

tokenise_text ( text,
stop_words = array(),
overlap_cjk = false,
join_numbers = false 
)

This function process the text passed at input, extracting all the tokens and scoring each one based in their number of ocurrences and relation with some well-known html tags

Parameters:
string$textthe text to be tokenised.
array$stop_wordsarray of utf-8 words than can be ignored in the text being processed. There are some cool lists of stop words at http://snowball.tartarus.org/
boolean$overlap_cjkoption to split CJK text into some overlapping tokens is order to allow them to be searched. Useful to build indexes and search systems.
boolean$join_numbersoption to join in one unique token sequences of numbers separated by puntuaction chars. Useful to build indexes and search systems.
Returns:
array one sorted array of tokens, with tokens being the keys and scores in the values.

Definition at line 162 of file tokeniserlib.php.

tokenise_truncate_word ( &$  text)

Helper function for array_walk in search_index_split. Truncates one string (token) to MAXIMUM_WORD_SIZE

Definition at line 406 of file tokeniserlib.php.

 All Data Structures Namespaces Files Functions Variables Enumerations