Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

A scholarly dataset can be understood as a discrete time series, i.e., a sequence of events/observations which are ordered in one dimension – time. Observations (e.g., papers) come into existence at regularly spaced intervals, (e.g., each week, month, issue, volume, or year).
Kleinberg's burst detection algorithm (Kleinberg, 2002) identifies sudden increases in the frequency of words. Rather than using simple frequencies of the occurrences of words, the algorithm employs a probabilistic automaton whose states correspond to increasing frequencies of individual words. State transitions correspond to points in time around which the frequency of the word changes significantly. The algorithm generates a list of the word bursts in the document stream, ranked according to the burst weight, together with the intervals of time in which these bursts occurred. The burst weight depicts the intensity of the burst, i.e., how great the change in the word frequency that triggered the burst. The user must choose if s/he wants to detect bursts related to author names, journal names, country names, references, ISI keywords, or terms used in the title and/or abstract of a paper. This can serve as a means of identifying relevant topics, terms, or concepts that increased in usage, were more active for a period of time, and then faded away.
Kleinberg's burst detection algorithm has four important parameters: the gamma value, first ratio, general ratio, and the number of bursting states. The gamma parameter controls the ease with which the automaton can change states. The higher the gamma value, the smaller the list of bursts generated. The first ratio and general ratio control how great the change of frequency of a word must be to be considered a burst. Usually, only one bursting state is used, since one bursting state is sufficient to indicate whether or not a burst occurred for a specific character string within a specific time period. However, if the user wishes to identify bursts inside bursts, s/he must use more than one bursting state to capture such a hierarchical structure. In Sci2, default values are provided for all parameters.
The Sci2 tool currently uses this algorithm in 'Analysis > Temporal > Burst Detection'.
Because the algorithm itself is case-sensitive, care must be taken if the user desires 'KOREA' and 'korea' and 'Korea' to be identified as the same word. It is recommended that the user normalize the target data before applying the burst algorithm. The normalization process separates text into word tokens, normalizes word tokens to lower case, removes "s" from the end of words, removes dots from acronyms, deletes stop words, and applies the English Snowball stemmer (http://snowball.tartarus.org/algorithms/english/stemmer.html. The normalization of an entry column can be done using the menu 'Preprocessing > Topical > 'Lowercase, Tokenize, Stem, and Stopword Text'.
For an example on how to use the burst detection algorithm, see Section 5.2.5 Burst Detection in Physics and Complex Networks (ISI Data).

Anchor
4.6.2 Slice Table by Time
4.6.2 Slice Table by Time
4.6.2 Slice Table by Time

...