Child pages
  • Lowercase, Tokenize, Stem, and Stopword Text
Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 8 Current »


Normalizes free-form text from selected columns within a given table. For example, the input text "Emergence of Scaling in Random Networks" becomes "emerg|scale|random|network" where we have chosen "|" as the character separating individual items of the list.

From this example you can follow the four normalization steps:

  1. Lowercase: The example text becomes "emergence of scaling in random networks".
  2. Tokenize: The text blob is split into a list of individual words. The example text becomes "emergence|of|scaling|in|random|networks".
  3. Stem: Common or low-content prefixes and suffixes are removed to identify the core concept. The example text becomes "emerg|of|scale|in|random|network".
  4. Stopword: Low-content tokens like "of" and "in" are removed (see the complete stopword list). The example text becomes "emerg|scale|random|network".
  • Stopword List: The plain-text file that contains the list of stopwords to use. By default, it points to the included stopword list. If an invalid file path is specified, it will again default to the included stopword list. Stopwords are separated by line (so each line lists a single stopword).
  • New Separator: The character that will separate items in the output lists of tokens.
  • Each individual textual column of the table can be selected or not selected for normalization.

This algorithm can prepare the text in a table for Burst Detection.

See Also

The license could not be verified: License Certificate has expired! Generate a Free license now.

  • No labels