Namespaces
namespace	moodlecore
Enumerations
enum	MINIMUM_WORD_SIZE
enum	MAXIMUM_WORD_SIZE
	Minimum word size to index and search. More...
enum	START_DELIM
	Maximum word size to index and search. More...
enum	CENTER_DELIM
enum	END_DELIM
enum	PREG_CLASS_SEARCH_EXCLUDE
enum	PREG_CLASS_NUMBERS
enum	PREG_CLASS_PUNCTUATION
enum	PREG_CLASS_CJK
Functions
	tokenise_text ($text, $stop_words=array(), $overlap_cjk=false, $join_numbers=false)
	tokenise_split ($text, $stop_words, $overlap_cjk, $join_numbers)
	tokenise_simplify ($text, $overlap_cjk, $join_numbers)
	tokenise_expand_cjk ($matches)
	tokenise_truncate_word (&$text)

Enumeration Type Documentation

enum CENTER_DELIM

Definition at line 60 of file tokeniserlib.php.

enum END_DELIM

Definition at line 61 of file tokeniserlib.php.

enum MAXIMUM_WORD_SIZE

Minimum word size to index and search.

Definition at line 57 of file tokeniserlib.php.

enum MINIMUM_WORD_SIZE

Some constants

Definition at line 56 of file tokeniserlib.php.

enum PREG_CLASS_CJK

Matches all CJK characters that are candidates for auto-splitting (Chinese, Japanese, Korean). Contains kana and BMP ideographs.

Definition at line 141 of file tokeniserlib.php.

enum PREG_CLASS_NUMBERS

Matches all 'N' Unicode character classes (numbers)

Definition at line 107 of file tokeniserlib.php.

enum PREG_CLASS_PUNCTUATION

Matches all 'P' Unicode character classes (punctuation)

Definition at line 120 of file tokeniserlib.php.

enum PREG_CLASS_SEARCH_EXCLUDE

Matches Unicode character classes to exclude from the search index.

See: http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values

The index only contains the following character classes: Lu Letter, Uppercase Ll Letter, Lowercase Lt Letter, Titlecase Lo Letter, Other Nd Number, Decimal Digit No Number, Other

Definition at line 76 of file tokeniserlib.php.

enum START_DELIM

Maximum word size to index and search.

Definition at line 59 of file tokeniserlib.php.

Function Documentation

tokenise_expand_cjk ( $ matches )

Basic CJK tokeniser. Simply splits a string into consecutive, overlapping sequences of characters (MINIMUM_WORD_SIZE long).

Definition at line 375 of file tokeniserlib.php.

tokenise_simplify	(	$	text,
		$	overlap_cjk,
		$	join_numbers
	)

Simplifies a string according to indexing rules.

Definition at line 317 of file tokeniserlib.php.

tokenise_split	(	$	text,
		$	stop_words,
		$	overlap_cjk,
		$	join_numbers
	)

Some helper functions (should be considered private) Splits a string into tokens

Definition at line 287 of file tokeniserlib.php.

tokenise_text	(	$	text,
		$	stop_words = `array()`,
		$	overlap_cjk = `false`,
		$	join_numbers = `false`
	)

This function process the text passed at input, extracting all the tokens and scoring each one based in their number of ocurrences and relation with some well-known html tags

Parameters:

string	$text	the text to be tokenised.
array	$stop_words	array of utf-8 words than can be ignored in the text being processed. There are some cool lists of stop words at http://snowball.tartarus.org/
boolean	$overlap_cjk	option to split CJK text into some overlapping tokens is order to allow them to be searched. Useful to build indexes and search systems.
boolean	$join_numbers	option to join in one unique token sequences of numbers separated by puntuaction chars. Useful to build indexes and search systems.

Returns:: array one sorted array of tokens, with tokens being the keys and scores in the values.

Definition at line 162 of file tokeniserlib.php.

tokenise_truncate_word ( &$ text )

Helper function for array_walk in search_index_split. Truncates one string (token) to MAXIMUM_WORD_SIZE

Definition at line 406 of file tokeniserlib.php.

Namespaces

Enumerations

Functions

Enumeration Type Documentation

Function Documentation