| libextractor - Documentation | ||||||||||||
| Further documentationThis documentation covers the major aspects of libextractor. The man pages for extract and libextractor are also on-line. You can browse the Doxygen generated documentation of the source code.An article describing libextractor was published in the LinuxJournal and is available here. Copyright and ContributionslibExtractor is released under the GNU General Public License. All contributions must thus be put under the GNU Public License (GPL) or a compatible license.InstallationThe simplest way to install libextractor is to use one of the binary packages which are available online for many distributions. Note that under Debian, the extract tool is in a separate package extract and headers required to compile other applications against libextractor are in libextractor0-devel. Thus, under Debian, you should use:# apt-get install libextractor0-dev extractIf you want to compile libextractor from source you will need an unusual amount of memory: 256 MB system memory is roughly the minimum, since gcc will take about 200 MB to compile one of the plugins. Otherwise, compiling by hand follows the usual sequence: $ tar xzvf libextractor.x.x.x.tar.gz $ cd libextractor.x.x.x $ ./configure $ make # make installNote that you need glib 2.4, libvorbisfile and zlib (including headers) in order to compile all of the plugins. Using the extract toolAfter installing libextractor, the extract tool can be used to obtain meta-data from documents. By default, the extract tool uses the canonical set of plugins, which consists of all file-format-specific plugins supported by the current version of libextractor together with the mime-type detection plugin. If you are a user of BibTeX the option -b is likely to come in handy to automatically create bibtex entries from documents that have been properly equipped with meta-data:$ wget -q http://www.copyright.gov/legislation/dmca.pdf
$ extract -b ~/dmca.pdf
% BiBTeX file
@misc{ unite2001the_d,
title = "The Digital Millennium Copyright Act of 1998",
author = "United States Copyright Office - jmf",
note = "digital millennium copyright act circumvention...",
year = "2001",
month = "10",
key = "Copyright Office Summary of the DMCA",
pages = "18"
}
Another interesting option is -B"LANG. This option loads one of the language specific (but format-agnostic) plugins. These plugins attempt to find plaintext in a document by matching strings in the document against a dictionary. If the need for 200 MB of memory to compile libextractor seems mysterious, the answer lies in these plugins. In order to be able to perform a fast dictionary search, a bloomfilter is created that allows fast probabilistic matching; gcc finds the resulting datastructure a bit hard to swallow. The option -B is useful for formats that are undocumented, currently unsupported or for full-text search. Note that the printable plugins typically print the entire text of the document in order. The supported languages at the moment are Danish (da), German (de), English (en), Spanish (es), Italian (it) and Norvegian (no). Supporting other languages is merely a question of adding (free) dictionaries in an appropriate character set. Further options are described in the extract manpage (man 1 extract).
Examples:$ extract libextractor-0.1.3-1.src.rpm os - linux resource-identifier - http://ovmj.org/libextractor/ group -System Environment/Libraries license - LGPL copyright - LGPL size - 251545 build-host - wedge.cs.purdue.edu creation date - Wed Dec 25 07:50:07 2002 description - libextractor is a simple library... summary - keyword extraction library release - 1 version - 0.1.3 title - libextractor unknown - SOURCE RPM 3.0 mimetype - application/x-rpm
$ extract extractor_logo.png unknown - The libextractor logo mimetype - image/png The following is the output of extract for a Winword document using the plaintext extractors: $ wget -q http://www.bayern.de/HDBG/polges.doc $ extract -B de polges.doc | head -n 4 unknown - FEE Politische Geschichte Bayerns Herausgegeben vom Haus der Geschichte als Heft der zur Geschichte und Kultur Redaktion Manfred Bearbeitung Otto Copyright Haus der Geschichte München Gestaltung fürs Internet Rudolf Inhalt im. unknown - und das Deutsche Reich. unknown - und seine. unknown - Henker im Zeitalter von Reformation und Gegenreformation.You can also use the demo page to uplad a file and run extract on it. Using the libextractor libraryThe following listing shows the code of a minimalistic program that uses libextractor. Compiling the fragment requires passing the option -lextractor to gcc. The EXTRACTOR_KeywordList is a simple linked list containing a keyword and a keyword type. For details and additional functions for loading plugins and manipulating the keyword list, see the libextractor manpage (man 3 libextractor). Java programmers should note that a Java class that uses JNI to communicate with libextractor is also available. Python programmers will find that libextractor (since 0.5.0) can also be used from Python, just import Extractor.int main(int argc, char * argv[]) {
EXTRACTOR_ExtractorList *extractors
= EXTRACTOR_loadDefaultLibraries();
EXTRACTOR_KeywordList *keywords
= EXTRACTOR_getKeywords(extractors, argv[1]);
EXTRACTOR_printKeywords(stdout,
keywords);
EXTRACTOR_freeKeywords(keywords);
EXTRACTOR_removeAll(extractors);
}
Current PluginsHTML, PDF, PS, OLE2 (DOC, XLS, PPT), OpenOffice (sxw), StarOffice (sdw), DVI, MAN, FLAC, MP3 (ID3v1 and ID3v2), NSF (NES Sound Format), SID, OGG, WAV, EXIV2, JPEG, GIF, PNG, TIFF, DEB, RPM, TAR(.GZ), ZIP, ELF, FLV, REAL, RIFF (AVI), MPEG, QT and ASF.Writing new PluginsThe most complicated thing when writing a new plugin for libextractor is the writing of the actual parser for a specific format. Nevertheless, the basic pattern is always the same. The plugin library must be called libextractor_XXX.so where XXX denotes the file format supported by the plugin. The library must export a method libextractor_XXX_extract with the following signature:struct EXTRACTOR_Keywords * libextractor_XXX_extract (char * filename, char * data, size_t size, struct EXTRACTOR_Keywords * prev, const char* options); The argument filename specifies the name of the file that is being processed. data is a pointer to the (typically mmapped) contents of the file, and size is the filesize. Most plugins to not make use of the filename and just directly parse data directly, staring by verifying that the header of the data matches the specific format. prev is the list of keywords that have been extracted so far by other plugins for the file. The function is expected to return an updated list of keywords. The keywords are supposed to be converted into the UTF-8 character set by the plugin. If the format does not match the expectations of the plugin, prev is returned. Most plugins use a function like addKeyword to extend the list: static void addKeyword
(struct EXTRACTOR_Keywords ** list,
char * keyword,
EXTRACTOR_KeywordType type)
{
EXTRACTOR_KeywordList * next;
next = malloc(sizeof(EXTRACTOR_KeywordList));
next->next = *list;
next->keyword = keyword;
next->keywordType = type;
*list = next;
}
A typical use of addKeyword is to add the mime-type once the file format has been established. For example, the JPEG-extractor checks the first bytes of the JPEG header and then either aborts or claims the file to be a JPEG. Note that the strdup in the code is important since the string will be deallocated later, typically in EXTRACTOR_freeKeywords(). A list of supported keyword classifications (in the example EXTRACTOR_MIMETYPE) can be found in the extractor.h header file. | |||||||||||
Translation engine based on i18nHTML (C) 2003, 2004, 2005, 2006, 2007 Christian Grothoff.