[HOME]

Path : /proc/self/root/usr/local/share/man/man3/
Upload :
Current File : //proc/self/root/usr/local/share/man/man3/DBD::SQLite::Fulltext_search.3pm

.\" Automatically generated by Pod::Man 2.27 (Pod::Simple 3.28)
.\"
.\" Standard preamble:
.\" ========================================================================
.de Sp \" Vertical space (when we can't use .PP)
.if t .sp .5v
.if n .sp
..
.de Vb \" Begin verbatim text
.ft CW
.nf
.ne \\$1
..
.de Ve \" End verbatim text
.ft R
.fi
..
.\" Set up some character translations and predefined strings.  \*(-- will
.\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left
.\" double quote, and \*(R" will give a right double quote.  \*(C+ will
.\" give a nicer C++.  Capital omega is used to do unbreakable dashes and
.\" therefore won't be available.  \*(C` and \*(C' expand to `' in nroff,
.\" nothing in troff, for use with C<>.
.tr \(*W-
.ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p'
.ie n \{\
.    ds -- \(*W-
.    ds PI pi
.    if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch
.    if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\"  diablo 12 pitch
.    ds L" ""
.    ds R" ""
.    ds C` ""
.    ds C' ""
'br\}
.el\{\
.    ds -- \|\(em\|
.    ds PI \(*p
.    ds L" ``
.    ds R" ''
.    ds C`
.    ds C'
'br\}
.\"
.\" Escape single quotes in literal strings from groff's Unicode transform.
.ie \n(.g .ds Aq \(aq
.el       .ds Aq '
.\"
.\" If the F register is turned on, we'll generate index entries on stderr for
.\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index
.\" entries marked with X<> in POD.  Of course, you'll have to process the
.\" output yourself in some meaningful fashion.
.\"
.\" Avoid warning from groff about undefined register 'F'.
.de IX
..
.nr rF 0
.if \n(.g .if rF .nr rF 1
.if (\n(rF:(\n(.g==0)) \{
.    if \nF \{
.        de IX
.        tm Index:\\$1\t\\n%\t"\\$2"
..
.        if !\nF==2 \{
.            nr % 0
.            nr F 2
.        \}
.    \}
.\}
.rr rF
.\"
.\" Accent mark definitions (@(#)ms.acc 1.5 88/02/08 SMI; from UCB 4.2).
.\" Fear.  Run.  Save yourself.  No user-serviceable parts.
.    \" fudge factors for nroff and troff
.if n \{\
.    ds #H 0
.    ds #V .8m
.    ds #F .3m
.    ds #[ \f1
.    ds #] \fP
.\}
.if t \{\
.    ds #H ((1u-(\\\\n(.fu%2u))*.13m)
.    ds #V .6m
.    ds #F 0
.    ds #[ \&
.    ds #] \&
.\}
.    \" simple accents for nroff and troff
.if n \{\
.    ds ' \&
.    ds ` \&
.    ds ^ \&
.    ds , \&
.    ds ~ ~
.    ds /
.\}
.if t \{\
.    ds ' \\k:\h'-(\\n(.wu*8/10-\*(#H)'\'\h"|\\n:u"
.    ds ` \\k:\h'-(\\n(.wu*8/10-\*(#H)'\`\h'|\\n:u'
.    ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'^\h'|\\n:u'
.    ds , \\k:\h'-(\\n(.wu*8/10)',\h'|\\n:u'
.    ds ~ \\k:\h'-(\\n(.wu-\*(#H-.1m)'~\h'|\\n:u'
.    ds / \\k:\h'-(\\n(.wu*8/10-\*(#H)'\z\(sl\h'|\\n:u'
.\}
.    \" troff and (daisy-wheel) nroff accents
.ds : \\k:\h'-(\\n(.wu*8/10-\*(#H+.1m+\*(#F)'\v'-\*(#V'\z.\h'.2m+\*(#F'.\h'|\\n:u'\v'\*(#V'
.ds 8 \h'\*(#H'\(*b\h'-\*(#H'
.ds o \\k:\h'-(\\n(.wu+\w'\(de'u-\*(#H)/2u'\v'-.3n'\*(#[\z\(de\v'.3n'\h'|\\n:u'\*(#]
.ds d- \h'\*(#H'\(pd\h'-\w'~'u'\v'-.25m'\f2\(hy\fP\v'.25m'\h'-\*(#H'
.ds D- D\\k:\h'-\w'D'u'\v'-.11m'\z\(hy\v'.11m'\h'|\\n:u'
.ds th \*(#[\v'.3m'\s+1I\s-1\v'-.3m'\h'-(\w'I'u*2/3)'\s-1o\s+1\*(#]
.ds Th \*(#[\s+2I\s-2\h'-\w'I'u*3/5'\v'-.3m'o\v'.3m'\*(#]
.ds ae a\h'-(\w'a'u*4/10)'e
.ds Ae A\h'-(\w'A'u*4/10)'E
.    \" corrections for vroff
.if v .ds ~ \\k:\h'-(\\n(.wu*9/10-\*(#H)'\s-2\u~\d\s+2\h'|\\n:u'
.if v .ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'\v'-.4m'^\v'.4m'\h'|\\n:u'
.    \" for low resolution devices (crt and lpr)
.if \n(.H>23 .if \n(.V>19 \
\{\
.    ds : e
.    ds 8 ss
.    ds o a
.    ds d- d\h'-1'\(ga
.    ds D- D\h'-1'\(hy
.    ds th \o'bp'
.    ds Th \o'LP'
.    ds ae ae
.    ds Ae AE
.\}
.rm #[ #] #H #V #F C
.\" ========================================================================
.\"
.IX Title "DBD::SQLite::Fulltext_search 3"
.TH DBD::SQLite::Fulltext_search 3 "2019-05-22" "perl v5.16.3" "User Contributed Perl Documentation"
.\" For nroff, turn off justification.  Always turn off hyphenation; it makes
.\" way too many mistakes in technical documents.
.if n .ad l
.nh
.SH "NAME"
DBD::SQLite::Fulltext_search \- Using fulltext searches with DBD::SQLite
.SH "DESCRIPTION"
.IX Header "DESCRIPTION"
.SS "Introduction"
.IX Subsection "Introduction"
SQLite is bundled with an extension module called \*(L"\s-1FTS\*(R"\s0 for full-text
indexing. Tables with this feature enabled can be efficiently queried
to find rows that contain one or more instances of some specified
words (also called \*(L"tokens\*(R"), in any column, even if the table contains many
large documents.
.PP
The first full-text search modules for SQLite were called \f(CW\*(C`FTS1\*(C'\fR and \f(CW\*(C`FTS2\*(C'\fR
and are now obsolete. The latest version is \f(CW\*(C`FTS4\*(C'\fR, but it shares many
features with the former module \f(CW\*(C`FTS3\*(C'\fR, which is why parts of the 
\&\s-1API\s0 and parts of the documentation still refer to \f(CW\*(C`FTS3\*(C'\fR; from a client
point of view, both can be considered largely equivalent.
Detailed documentation can be found
at <http://www.sqlite.org/fts3.html>.
.SS "Short example"
.IX Subsection "Short example"
Here is a very short example of using \s-1FTS :\s0
.PP
.Vb 2
\&  $dbh\->do(<<"") or die DBI::errstr;
\&  CREATE VIRTUAL TABLE fts_example USING fts4(content)
\&  
\&  my $sth = $dbh\->prepare("INSERT INTO fts_example(content) VALUES (?)");
\&  $sth\->execute($_) foreach @docs_to_insert;
\&  
\&  my $results = $dbh\->selectall_arrayref(<<"");
\&  SELECT docid, snippet(fts_example) FROM fts_example WHERE content MATCH \*(Aqfoo\*(Aq
.Ve
.PP
The key points in this example are :
.IP "\(bu" 4
The syntax for creating \s-1FTS\s0 tables is
.Sp
.Vb 1
\&  CREATE VIRTUAL TABLE <table_name> USING fts4(<columns>)
.Ve
.Sp
where \f(CW\*(C`<columns>\*(C'\fR is a list of column names. Columns may be
typed, but the type information is ignored. If no columns
are specified, the default is a single column named \f(CW\*(C`content\*(C'\fR.
In addition, \s-1FTS\s0 tables have an implicit column called \f(CW\*(C`docid\*(C'\fR
(or also \f(CW\*(C`rowid\*(C'\fR) for numbering the stored documents.
.IP "\(bu" 4
Statements for inserting, updating or deleting records 
use the same syntax as for regular SQLite tables.
.IP "\(bu" 4
Full-text searches are specified with the \f(CW\*(C`MATCH\*(C'\fR operator, and an
operand which may be a single word, a word prefix ending with '*', a
list of words, a \*(L"phrase query\*(R" in double quotes, or a boolean combination
of the above.
.IP "\(bu" 4
The builtin function \f(CW\*(C`snippet(...)\*(C'\fR builds a formatted excerpt of the
document text, where the words pertaining to the query are highlighted.
.PP
There are many more details to building and searching
\&\s-1FTS\s0 tables, so we strongly invite you to read
the full documentation at <http://www.sqlite.org/fts3.html>.
.SH "QUERY SYNTAX"
.IX Header "QUERY SYNTAX"
Here are some explanation about \s-1FTS\s0 queries, borrowed from 
the sqlite documentation.
.SS "Token or token prefix queries"
.IX Subsection "Token or token prefix queries"
An \s-1FTS\s0 table may be queried for all documents that contain a specified
term, or for all documents that contain a term with a specified
prefix. The query expression for a specific term is simply the term
itself. The query expression used to search for a term prefix is the
prefix itself with a '*' character appended to it. For example:
.PP
.Vb 2
\&  \-\- Virtual table declaration
\&  CREATE VIRTUAL TABLE docs USING fts3(title, body);
\&  
\&  \-\- Query for all documents containing the term "linux":
\&  SELECT * FROM docs WHERE docs MATCH \*(Aqlinux\*(Aq;
\&  
\&  \-\- Query for all documents containing a term with the prefix "lin".
\&  SELECT * FROM docs WHERE docs MATCH \*(Aqlin*\*(Aq;
.Ve
.PP
If a search token (on the right-hand side of the \s-1MATCH\s0 operator) 
begins with \*(L"^\*(R" then that token must be the first in its field of
the document : so for example \f(CW\*(C`^lin*\*(C'\fR matches
\&'linux kernel changes ...' but does not match 'new linux implementation'.
.SS "Column specifications"
.IX Subsection "Column specifications"
Normally, a token or token prefix query is matched against the \s-1FTS\s0
table column specified as the right-hand side of the \s-1MATCH\s0
operator. Or, if the special column with the same name as the \s-1FTS\s0
table itself is specified, against all columns. This may be overridden
by specifying a column-name followed by a \*(L":\*(R" character before a basic
term query. There may be space between the \*(L":\*(R" and the term to query
for, but not between the column-name and the \*(L":\*(R" character. For
example:
.PP
.Vb 4
\&  \-\- Query the database for documents for which the term "linux" appears in
\&  \-\- the document title, and the term "problems" appears in either the title
\&  \-\- or body of the document.
\&  SELECT * FROM docs WHERE docs MATCH \*(Aqtitle:linux problems\*(Aq;
\&
\&  \-\- Query the database for documents for which the term "linux" appears in
\&  \-\- the document title, and the term "driver" appears in the body of the document
\&  \-\- ("driver" may also appear in the title, but this alone will not satisfy the.
\&  \-\- query criteria).
\&  SELECT * FROM docs WHERE body MATCH \*(Aqtitle:linux driver\*(Aq;
.Ve
.SS "Phrase queries"
.IX Subsection "Phrase queries"
A phrase query is a query that retrieves all documents that contain a
nominated set of terms or term prefixes in a specified order with no
intervening tokens. Phrase queries are specified by enclosing a space
separated sequence of terms or term prefixes in double quotes ("). For
example:
.PP
.Vb 2
\&  \-\- Query for all documents that contain the phrase "linux applications".
\&  SELECT * FROM docs WHERE docs MATCH \*(Aq"linux applications"\*(Aq;
\&
\&  \-\- Query for all documents that contain a phrase that matches "lin* app*". 
\&  \-\- As well as "linux applications", this will match common phrases such 
\&  \-\- as "linoleum appliances" or "link apprentice".
\&  SELECT * FROM docs WHERE docs MATCH \*(Aq"lin* app*"\*(Aq;
.Ve
.SS "\s-1NEAR\s0 queries."
.IX Subsection "NEAR queries."
A \s-1NEAR\s0 query is a query that returns documents that contain a two or
more nominated terms or phrases within a specified proximity of each
other (by default with 10 or less intervening terms). A \s-1NEAR\s0 query is
specified by putting the keyword \*(L"\s-1NEAR\*(R"\s0 between two phrase, term or
prefix queries. To specify a proximity other than the default, an
operator of the form \*(L"NEAR/<N>\*(R" may be used, where <N> is the maximum
number of intervening terms allowed. For example:
.PP
.Vb 2
\&  \-\- Virtual table declaration.
\&  CREATE VIRTUAL TABLE docs USING fts4();
\&
\&  \-\- Virtual table data.
\&  INSERT INTO docs VALUES(\*(AqSQLite is an ACID compliant embedded relational database management system\*(Aq);
\&
\&  \-\- Search for a document that contains the terms "sqlite" and "database" with
\&  \-\- not more than 10 intervening terms. This matches the only document in
\&  \-\- table docs (since there are only six terms between "SQLite" and "database" 
\&  \-\- in the document).
\&  SELECT * FROM docs WHERE docs MATCH \*(Aqsqlite NEAR database\*(Aq;
\&
\&  \-\- Search for a document that contains the terms "sqlite" and "database" with
\&  \-\- not more than 6 intervening terms. This also matches the only document in
\&  \-\- table docs. Note that the order in which the terms appear in the document
\&  \-\- does not have to be the same as the order in which they appear in the query.
\&  SELECT * FROM docs WHERE docs MATCH \*(Aqdatabase NEAR/6 sqlite\*(Aq;
\&
\&  \-\- Search for a document that contains the terms "sqlite" and "database" with
\&  \-\- not more than 5 intervening terms. This query matches no documents.
\&  SELECT * FROM docs WHERE docs MATCH \*(Aqdatabase NEAR/5 sqlite\*(Aq;
\&
\&  \-\- Search for a document that contains the phrase "ACID compliant" and the term
\&  \-\- "database" with not more than 2 terms separating the two. This matches the
\&  \-\- document stored in table docs.
\&  SELECT * FROM docs WHERE docs MATCH \*(Aqdatabase NEAR/2 "ACID compliant"\*(Aq;
\&
\&  \-\- Search for a document that contains the phrase "ACID compliant" and the term
\&  \-\- "sqlite" with not more than 2 terms separating the two. This also matches
\&  \-\- the only document stored in table docs.
\&  SELECT * FROM docs WHERE docs MATCH \*(Aq"ACID compliant" NEAR/2 sqlite\*(Aq;
.Ve
.PP
More than one \s-1NEAR\s0 operator may appear in a single query. In this case
each pair of terms or phrases separated by a \s-1NEAR\s0 operator must appear
within the specified proximity of each other in the document. Using
the same table and data as in the block of examples above:
.PP
.Vb 5
\&  \-\- The following query selects documents that contains an instance of the term 
\&  \-\- "sqlite" separated by two or fewer terms from an instance of the term "acid",
\&  \-\- which is in turn separated by two or fewer terms from an instance of the term
\&  \-\- "relational".
\&  SELECT * FROM docs WHERE docs MATCH \*(Aqsqlite NEAR/2 acid NEAR/2 relational\*(Aq;
\&
\&  \-\- This query matches no documents. There is an instance of the term "sqlite" with
\&  \-\- sufficient proximity to an instance of "acid" but it is not sufficiently close
\&  \-\- to an instance of the term "relational".
\&  SELECT * FROM docs WHERE docs MATCH \*(Aqacid NEAR/2 sqlite NEAR/2 relational\*(Aq;
.Ve
.PP
Phrase and \s-1NEAR\s0 queries may not span multiple columns within a row.
.SS "Set operations"
.IX Subsection "Set operations"
The three basic query types described above may be used to query the
full-text index for the set of documents that match the specified
criteria. Using the \s-1FTS\s0 query expression language it is possible to
perform various set operations on the results of basic queries. There
are currently three supported operations:
.IP "\(bu" 4
The \s-1AND\s0 operator determines the intersection of two sets of documents.
.IP "\(bu" 4
The \s-1OR\s0 operator calculates the union of two sets of documents.
.IP "\(bu" 4
The \s-1NOT\s0 operator may be used to compute the relative complement of one
set of documents with respect to another.
.PP
The \s-1AND, OR\s0 and \s-1NOT\s0 binary set operators must be entered using capital
letters; otherwise, they are interpreted as basic term queries instead
of set operators.  Each of the two operands to an operator may be a
basic \s-1FTS\s0 query, or the result of another \s-1AND, OR\s0 or \s-1NOT\s0 set
operation. Parenthesis may be used to control precedence and grouping.
.PP
The \s-1AND\s0 operator is implicit for adjacent basic queries without any
explicit operator. For example, the query expression \*(L"implicit
operator\*(R" is a more succinct version of \*(L"implicit \s-1AND\s0 operator\*(R".
.PP
Boolean operations as just described correspond to the so-called
\&\*(L"enhanced query syntax\*(R" of sqlite; this is the version compiled 
with \f(CW\*(C`DBD::SQLite\*(C'\fR, starting from version 1.31.
A former version, called the \*(L"standard query syntax\*(R", used to
support tokens prefixed with '+' or '\-' signs (for token inclusion
or exclusion); if your application needs to support this old
syntax, use  DBD::SQLite::FTS3Transitional (published
in a separate distribution) for doing the conversion.
.SH "TOKENIZERS"
.IX Header "TOKENIZERS"
.SS "Concept"
.IX Subsection "Concept"
The behaviour of full-text indexes strongly depends on how
documents are split into \fItokens\fR; therefore \s-1FTS\s0 table
declarations can explicitly specify how to perform
tokenization:
.PP
.Vb 1
\&  CREATE ... USING fts4(<columns>, tokenize=<tokenizer>)
.Ve
.PP
where \f(CW\*(C`<tokenizer>\*(C'\fR is a sequence of space-separated
words that triggers a specific tokenizer. Tokenizers can
be SQLite builtins, written in C code, or Perl tokenizers.
Both are as explained below.
.SS "SQLite builtin tokenizers"
.IX Subsection "SQLite builtin tokenizers"
SQLite comes with some builtin tokenizers (see
<http://www.sqlite.org/fts3.html#tokenizer>) :
.IP "simple" 4
.IX Item "simple"
Under the \fIsimple\fR tokenizer, a term is a contiguous sequence of
eligible characters, where eligible characters are all alphanumeric
characters, the \*(L"_\*(R" character, and all characters with \s-1UTF\s0 codepoints
greater than or equal to 128. All other characters are discarded when
splitting a document into terms. They serve only to separate adjacent
terms.
.Sp
All uppercase characters within the \s-1ASCII\s0 range (\s-1UTF\s0 codepoints less
than 128), are transformed to their lowercase equivalents as part of
the tokenization process. Thus, full-text queries are case-insensitive
when using the simple tokenizer.
.IP "porter" 4
.IX Item "porter"
The \fIporter\fR tokenizer uses the same rules to separate the input
document into terms, but as well as folding all terms to lower case it
uses the Porter Stemming algorithm to reduce related English language
words to a common root.
.IP "icu" 4
.IX Item "icu"
The \fIicu\fR tokenizer uses the \s-1ICU\s0 library to decide how to
identify word characters in different languages; however, this
requires SQLite to be compiled with the \f(CW\*(C`SQLITE_ENABLE_ICU\*(C'\fR
pre-processor symbol defined. So, to use this tokenizer, you need
edit \fIMakefile.PL\fR to add this flag in \f(CW@CC_DEFINE\fR, and then
recompile \f(CW\*(C`DBD::SQLite\*(C'\fR; of course, the prerequisite is to have
an \s-1ICU\s0 library available on your system.
.IP "unicode61" 4
.IX Item "unicode61"
The \fIunicode61\fR tokenizer works very much like \*(L"simple\*(R" except that it
does full unicode case folding according to rules in Unicode Version
6.1 and it recognizes unicode space and punctuation characters and
uses those to separate tokens. By contrast, the simple tokenizer only
does case folding of \s-1ASCII\s0 characters and only recognizes \s-1ASCII\s0 space
and punctuation characters as token separators.
.Sp
By default, \*(L"unicode61\*(R" also removes all diacritics from Latin script
characters. This behaviour can be overridden by adding the tokenizer
argument \f(CW"remove_diacritics=0"\fR. For example:
.Sp
.Vb 4
\&  \-\- Create tables that remove diacritics from Latin script characters
\&  \-\- as part of tokenization.
\&  CREATE VIRTUAL TABLE txt1 USING fts4(tokenize=unicode61);
\&  CREATE VIRTUAL TABLE txt2 USING fts4(tokenize=unicode61 "remove_diacritics=1");
\&
\&  \-\- Create a table that does not remove diacritics from Latin script
\&  \-\- characters as part of tokenization.
\&  CREATE VIRTUAL TABLE txt3 USING fts4(tokenize=unicode61 "remove_diacritics=0");
.Ve
.Sp
Additional options can customize the set of codepoints that unicode61
treats as separator characters or as token characters \*(-- see the
documentation in <http://www.sqlite.org/fts3.html#unicode61>.
.PP
If a more complex tokenizing algorithm is required, for example to
implement stemming, discard punctuation, or to recognize compound words,
use the perl tokenizer to implement your own logic, as explained below.
.SS "Perl tokenizers"
.IX Subsection "Perl tokenizers"
\fIDeclaring a perl tokenizer\fR
.IX Subsection "Declaring a perl tokenizer"
.PP
In addition to the builtin SQLite tokenizers, \f(CW\*(C`DBD::SQLite\*(C'\fR
implements a \fIperl\fR tokenizer, that can hook to any tokenizing
algorithm written in Perl. This is specified as follows :
.PP
.Vb 1
\&  CREATE ... USING fts4(<columns>, tokenize=perl \*(Aq<perl_function>\*(Aq)
.Ve
.PP
where \f(CW\*(C`<perl_function>\*(C'\fR is a fully qualified Perl function name
(i.e. prefixed by the name of the package in which that function is
declared). So for example if the function is \f(CW\*(C`my_func\*(C'\fR in the main 
program, write
.PP
.Vb 1
\&  CREATE ... USING fts4(<columns>, tokenize=perl \*(Aqmain::my_func\*(Aq)
.Ve
.PP
\fIWriting a perl tokenizer by hand\fR
.IX Subsection "Writing a perl tokenizer by hand"
.PP
That function should return a code reference that takes a string as
single argument, and returns an iterator (another function), which
returns a tuple \f(CW\*(C`($term, $len, $start, $end, $index)\*(C'\fR for each
term. Here is a simple example that tokenizes on words according to
the current perl locale
.PP
.Vb 3
\&  sub locale_tokenizer {
\&    return sub {
\&      my $string = shift;
\&
\&      use locale;
\&      my $regex      = qr/\ew+/;
\&      my $term_index = 0;
\&
\&      return sub { # closure
\&        $string =~ /$regex/g or return; # either match, or no more token
\&        my ($start, $end) = ($\-[0], $+[0]);
\&        my $len           = $end\-$start;
\&        my $term          = substr($string, $start, $len);
\&        return ($term, $len, $start, $end, $term_index++);
\&      }
\&    };
\&  }
.Ve
.PP
There must be three levels of subs, in a kind of \*(L"Russian dolls\*(R" structure,
because :
.IP "\(bu" 4
the external, named sub is called whenever accessing a \s-1FTS\s0 table
with that tokenizer
.IP "\(bu" 4
the inner, anonymous sub is called whenever a new string
needs to be tokenized (either for inserting new text into the table,
or for analyzing a query).
.IP "\(bu" 4
the innermost, anonymous sub is called repeatedly for retrieving
all terms within that string.
.PP
\fIUsing Search::Tokenizer\fR
.IX Subsection "Using Search::Tokenizer"
.PP
Instead of writing tokenizers by hand, you can grab one of those
already implemented in the Search::Tokenizer module. For example,
if you want ignore differences between accented characters, you can
write :
.PP
.Vb 4
\&  use Search::Tokenizer;
\&  $dbh\->do(<<"") or die DBI::errstr;
\&  CREATE ... USING fts4(<columns>, 
\&                        tokenize=perl \*(AqSearch::Tokenizer::unaccent\*(Aq)
.Ve
.PP
Alternatively, you can use \*(L"new\*(R" in Search::Tokenizer to build
your own tokenizer. Here is an example that treats compound
words (words with an internal dash or dot) as single tokens :
.PP
.Vb 5
\&  sub my_tokenizer {
\&    return Search::Tokenizer\->new(
\&      regex => qr{\ep{Word}+(?:[\-./]\ep{Word}+)*},
\&     );
\&  }
.Ve
.SH "Fts4aux \- Direct Access to the Full-Text Index"
.IX Header "Fts4aux - Direct Access to the Full-Text Index"
The content of a full-text index can be accessed through the
virtual table module \*(L"fts4aux\*(R". For example, assuming that
our database contains a full-text indexed table named \*(L"ft\*(R",
we can declare :
.PP
.Vb 1
\&  CREATE VIRTUAL TABLE ft_terms USING fts4aux(ft)
.Ve
.PP
and then query the \f(CW\*(C`ft_terms\*(C'\fR table to access the
list of terms, their frequency, etc.
Examples are documented in
<http://www.sqlite.org/fts3.html#fts4aux>.
.SH "How to spare database space"
.IX Header "How to spare database space"
By default, \s-1FTS\s0 stores a complete copy of the indexed documents,
together with the fulltext index. On a large collection of documents,
this can consume quite a lot of disk space. However, \s-1FTS\s0 has some
options for compressing the documents, or even for not storing them at
all \*(-- see <http://www.sqlite.org/fts3.html#fts4_options>.
.PP
In particular, the option for \fIcontentless \s-1FTS\s0 tables\fR only stores
the fulltext index, without the original document content. This is
specified as \f(CW\*(C`content=""\*(C'\fR, like in the following example :
.PP
.Vb 1
\&  CREATE VIRTUAL TABLE t1 USING fts4(content="", a, b)
.Ve
.PP
Data can be inserted into such an \s-1FTS4\s0 table using an \s-1INSERT\s0
statements. However, unlike ordinary \s-1FTS4\s0 tables, the user must supply
an explicit integer docid value. For example:
.PP
.Vb 2
\&  \-\- This statement is Ok:
\&  INSERT INTO t1(docid, a, b) VALUES(1, \*(Aqa b c\*(Aq, \*(Aqd e f\*(Aq);
\&
\&  \-\- This statement causes an error, as no docid value has been provided:
\&  INSERT INTO t1(a, b) VALUES(\*(Aqj k l\*(Aq, \*(Aqm n o\*(Aq);
.Ve
.PP
Of course your application will need an algorithm for finding
the external resource corresponding to any \fIdocid\fR stored within
SQLite.
.PP
When using placeholders, the docid must be explicitly typed to 
\&\s-1INTEGER,\s0 because this is a \*(L"hidden column\*(R" for which sqlite 
is not able to automatically infer the proper type. So the following
doesn't work :
.PP
.Vb 2
\&  my $sth = $dbh\->prepare("INSERT INTO t1(docid, a, b) VALUES(?, ?, ?)");
\&  $sth\->execute(2, \*(Aqaa\*(Aq, \*(Aqbb\*(Aq); # constraint error
.Ve
.PP
but it works with an explicitly cast  :
.PP
.Vb 3
\&  my $sql = "INSERT INTO t1(docid, a, b) VALUES(CAST(? AS INTEGER), ?, ?)",
\&  my $sth = $dbh\->prepare(sql);
\&  $sth\->execute(2, \*(Aqaa\*(Aq, \*(Aqbb\*(Aq);
.Ve
.PP
or with an explicitly typed \*(L"bind_param\*(R" in \s-1DBI\s0 :
.PP
.Vb 7
\&  use DBI qw/SQL_INTEGER/;
\&  my $sql = "INSERT INTO t1(docid, a, b) VALUES(?, ?, ?)";
\&  my $sth = $dbh\->prepare(sql);
\&  $sth\->bind_param(1, 2, SQL_INTEGER);
\&  $sth\->bind_param(2, "aa");
\&  $sth\->bind_param(3, "bb");
\&  $sth\->execute();
.Ve
.PP
It is not possible to \s-1UPDATE\s0 or \s-1DELETE\s0 a row stored in a contentless
\&\s-1FTS4\s0 table. Attempting to do so is an error.
.PP
Contentless \s-1FTS4\s0 tables also support \s-1SELECT\s0 statements. However, it is
an error to attempt to retrieve the value of any table column other
than the docid column. The auxiliary function \f(CW\*(C`matchinfo()\*(C'\fR may be
used, but \f(CW\*(C`snippet()\*(C'\fR and \f(CW\*(C`offsets()\*(C'\fR may not, so if such
functionality is needed, it has to be directly programmed within the
Perl application.
.SH "AUTHOR"
.IX Header "AUTHOR"
Laurent Dami <dami@cpan.org>
.SH "COPYRIGHT"
.IX Header "COPYRIGHT"
Copyright 2014 Laurent Dami.
.PP
Some parts borrowed from the <http://sqlite.org> documentation, copyright 2014.
.PP
This documentation is in the public domain; you can redistribute
it and/or modify it under the same terms as Perl itself.