Indiana University, University of Rome, Yale University, Leiden University, International Center for Theoretical Physics, University of Paris-Sud
Informatics, Complex Network Science and System Research, Physics, Statistics, Epidemics
A scholarly dataset can be understood as a discrete time series: in other words, a sequence of events/ observations which are ordered in one dimension – time. Observations exist for regularly spaced intervals, e.g., each month or year.
The burst detection algorithm (see Section 4.6.1 Burst Detection) identifies sudden increases or "bursts" in the frequency-of-use of character strings over time. This algorithm identifies topics, terms, or concepts important to the events being studied that increased in usage, were more active for a period of time, and then faded away.
An analysis of publications authored or co-authored by Alessandro Vespignani from 1990 to 2006 will be used to illustrate the "burst" concept. Alessandro Vespignani is an Italian physicist and Professor of Informatics and Cognitive Science at Indiana University, Bloomington. In his publications, it is possible to see a change in research focus - from Physics to Complex Networks - beginning in 2001.
Load Alessandro Vespignani's ISI publication history using 'File > Load' and following this path: 'yoursci2directory/sampledata/scientometrics/isi/AlessandroVespignani.isi'.
This analysis will detect the "bursty" terms used in the title of papers in the dataset. Since the burst detection algorithm is case-sensitive, it is necessary to normalize the field to be analyzed before running the algorithm. Select the table "101 Unique ISI Records" and run 'Preprocessing > Topical > Lowercase, Tokenize, Stem, and Stopword Text.' Check the "Title" box to indicate that you want to normalize this field:
Select the resulting "with normalized Title" table in the Data Manager and run 'Analysis > Topical > Burst Detection' with the following parameters:
The "Gamma" parameter is the value that state transition costs are proportional to. This parameter is used to control how ease the automaton can change states. The higher the "Gamma" value, the smaller the list of bursts generated.
The "Density Scaling" parameter determines how much 'more bursty' each level is beyond the previous one. The higher the scaling value, the more active (bursty) the event happens in each level.
The "Bursting States" parameter determines how many bursting states there will be, beyond the non-bursting state. An i value of bursting states is equals to i + 1 automaton states.
The "Date Column" parameter is the name of the column with date/time when the events / topics happens.
The "Date Format" specifies how the date column will be interpreted as a date/time. See http://java.sun.com/j2se/1.4.2/docs/api/java/text/SimpleDateFormat.html for details.
The "Text Column" parameter is the name of the column with values (delimiter and tokens) to be computed for bursting results.
The "Text Separator" parameters determines the separator that was used to delimit the tokens in the text column.
View the file "Burst detection analysis (Publication Year, Title): maximum burst level 1":
In this table, there are six columns: "Word," "Level," "Weight," "Length," "Start," and "End."
The "Word" field identifies the specific character string which was detected as a "burst." The "Length" field indicates how long the burst lasted (over the selected time parameter).
The "Level" is the burst level of this burst. The higher burst level, the more frequent the event / topic happens.
The "Weight" field is the weight of this burst between its "Length". A higher weight could be resulted by the longer "Length", the higher "Level" or both.
The "Length" is the period of the burst. It is generated based on (Start - End + 1).
The "Start" field identifies when the burst began (again, according to the specified time parameter).
And the "End" field indicates when the burst stopped. A null value in the "End" field indicates that the burst lasted until the last date present in the dataset.
The resulting analysis indicates a change in the research focus of Alessandro Vespignani for publications beginning in 2001. For example, the bursting terms "fractal," "growth," "transform," and "fix" starting at 1990 are related to Vespignani's Ph.D., entitled "Fractal Growth and Self-Organized Criticality" in Physics. Other bursts also related to Physics follow these, such as "sandipil." After 2001, bursting terms like "complex," "network," "free," and "weight" appear, signifying a change in Vespignani's research area from Physics to Complex Networks, with a larger number of publications on topics like "weighted networks" and "scale-free networks."
Select the table 'with normalized Title' in the Data Manager and run 'Analysis > Topical > Burst Detection' with the following parameters:
Notice that the value for the gamma parameter is now set to 0.5. The parameter gamma controls the ease with which the automaton can change states. With a smaller gamma value, more bursts will be generated. Running the algorithm with these parameters will generate a new table named "Burst detection analysis (Publication Year, Title): maximum burst level 1.2" in the Data Manager.
As expected, a larger number of bursts appear, and the new bursts have a smaller weight that those depicted in the first graph. These smaller, more numerous bursting terms permit a more detailed view of the dataset and allow the identification of trends. The "protein" burst starting in 2003, for example, indicates the year in which Alessandro Vespignani started to work with "protein-protein interaction networks," while the burst "epidem" - also from 2001 - is related to the application of complex networks to the analysis of epidemic phenomena in biological networks.
Visualizing Burst Detection in Excel
In the Sci2 Tool, the algorithm can be found under 'Analysis > Temporal > Burst Detection'. As the algorithm itself is case sensitive, care must be taken if the user desires 'KOREA' and 'korea' and 'Korea' to be identified as the same word.
As the Garfield ISI data is very different in character from the rest, it is left out of the burst analysis done here. One particular difference is the absence of ISI keywords from most of the works in the Garfield dataset.
Use 'File > Load ' to load ThreeNetSciResearchers.isi, which is a file that contains all of Wasserman's, Vespignani's and Barabási's ISI records and is provided as a sample dataset in 'yoursci2directory/sampledata/scientometrics/isi/ThreeNetSciResearchers.isi'. The result is two new tables in the Data Manager. The first is a table with all ISI records. The second is a derived (indented) table with unique ISI records named '262 Unique ISI Records'. In the latter file, ISI records with unique ID numbers (UT field) are merged, and only the record with the higher citation count (CT value) is kept. Select the '262 Unique ISI Records' table and run 'Analysis > Temporal > Burst Detection' using the parameters:
A third table (derived from '262 Unique ISI Records') labeled 'Burst detection analysis ...' will appear in the Data Manager. On a PC running Windows, right click on this table and select view to see the data in Excel. On a Mac or a Linux system, right click and save the file, then open using the spreadsheet program of your choice. The table has 6 columns. The first column lists bursting words, here author names, the length of the burst, the burst weight, burst strength, together with the burst start and end year. Note that words can burst multiple times. If they do, then the burst 'weight' indicates how much support there is for the burst above the previous bursting level, while 'strength' indicates how much support there is for the burst over the non-bursting baseline. Since the burst detection algorithm was run with 'bursting state = 1', i.e., modeled only one burst per word, the burst weight is identical to the burst strength in this output.
To generate a visual depiction of the bursts in MS Excel perform the following steps:
1. Sort the data ascending by burst start year.
2. Add column headers for all years, i.e., enter first start year in G1, here 1980. Continue, e.g., using formula '=G1+1', until highest burst end year, here 2004 in cell AE1.
3. In the resulting word by burst year matrix, select the upper left blank cell (G2) and select 'Conditional Formatting' from the ribbon. Then select 'Data Bars > More Rules > Use a formula to determine which cells to format.' To color cells for years with a burst weight value of more or equal 10 red and cells with a higher value dark red use the following formulas and format patterns:
Select 'OK' and then repeat step three, using the formula below:
4. Once both formatting rules have been established, select 'Conditional Formatting > Manage Rules', highlight the first formatting rule and move to the top of the list:
5. Make sure both formatting rules are selected and apply them to current selection. Apply the format to all cells in the word by year matrix by dragging the box around cell G2 to highlight all cells in the matrix. The result for the given example is shown in Figure 5.33
Figure 5.32.1: Visualizing burst results in MS Excel