English»Software»Natural Language Processing & Information Extraction»Document Parsers & Cleaners

Blog

Directory

Various PDF, RTF, DOC, HTML, etc. parsers.

A tool for extracting plain text from Wikipedia dumps - seems to be one of the best text extrctors.

Alternative parsers for Wikipedia dumps

Apache PDFBox - Java PDF Library

Boilerpipe: remove boilerplate HTML code

dkpro-jwpl - (DKPro Java Wikipedia Library) is a free, Java-based application programming interface that facilitates access to all information in Wikipedia.

grobid - A machine learning software for extracting information from scholarly documents (a highly recommended one).

Html cleaner

Html Tidy

JSoup: Java HTML Parser

Multivalent - Multivalent works on numerous document types, including PDF, HTML, DVI, UNIX man pages and more. It is especially useful for PDF, because tries to paste together word fragments into whole words. Multivalent also integrates with Lucene.

PDFtoHTML - PDFtoHTML is a utility which converts PDF files into HTML and XML formats. It is based on XPDF.

python-ftfy - ftfy: fixes Unicode

science-parse - AI2 PDF parser for scientific articles.

scrapinghub - A collection of utilities for web scraping, parsing, and cleaning.

Tika - The Apache Tika toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.

wikixmlj easy access to Wikipedia XML dumps.

Xpdf: pdf utils including pdftotet