|Table of Contents|
Indiana University, University of Rome, Yale University, Leiden University, International Center for Theoretical Physics, University of Paris-Sud
Informatics, Complex Network Science and System Research, Physics, Statistics, Epidemics
18.104.22.168 Burst Detection
A scholarly dataset can be understood as a discrete time series: in other words, a sequence of events/ observations which are ordered in one dimension – time. Observations exist for regularly spaced intervals, e.g., each month or year.
The burst detection algorithm (see Section 4.6.1 Burst Detection) identifies sudden increases or "bursts" in the frequency-of-use of character strings over time. This algorithm identifies topics, terms, or concepts important to the events being studied that increased in usage, were more active for a period of time, and then faded away.
An analysis of publications authored or co-authored by Alessandro Vespignani from 1990 to 2006 will be used to illustrate the "burst" concept. Alessandro Vespignani is an Italian physicist and Professor of Informatics and Cognitive Science at Indiana University, Bloomington. In his publications, it is possible to see a change in research focus - from Physics to Complex Networks - beginning in 2001.
Load Alessandro Vespignani's ISI publication history using 'File > Load' and following this path: 'yoursci2directory/sampledata/scientometrics/isi/AlessandroVespignani.isi' (if the file is not in the sample data directory it can be downloaded from 2.5 Sample Datasets).
New ISI File Format
Web of Science made a change to their output format in September, 2011. Older versions of Sci2 tool (older than v0.5.2 alpha) may refuse to load these new files, with an error like "Invalid ISI format file selected."
If you are using an older version of the Sci2 tool, you can download the WOS-plugins.zip file and unzip the JAR files into your sci2/plugins/ directory. Restart Sci2 to activate the fixes. You can now load the downloaded ISI files into the Sci2 without any additional step. If you are using the old Sci2 tool you will need to follow the guidelines below before you can load the new WOS format file into the tool.
You can fix this problem for individual files by opening them in Notepad (or your favorite text editor). The file will start with the words:
Just add the word ISI.
And then Save the file.
The ISI file should now load properly. More information on the ISI file format is available here (http://wiki.cns.iu.edu/display/CISHELL/ISI+%28*.isi%29).
This analysis will detect the "bursty" terms used in the title of papers in the dataset. Since the burst detection algorithm is case-sensitive, it is necessary to normalize the field to be analyzed before running the algorithm. Select the table "101 Unique ISI Records" and run 'Preprocessing > Topical > Lowercase, Tokenize, Stem, and Stopword Text.' Check the "Title" box to indicate that you want to normalize this field:
Select the resulting "with normalized Title" table in the Data Manager and run 'Analysis > Topical > Burst Detection' with the following parameters:
The "Gamma" parameter is the value that state transition costs are proportional to. This parameter is used to control how easy the automaton can change states. The higher the "Gamma" value, the smaller the list of bursts generated.
The "Density Scaling" parameter determines how much 'more bursty' each level is beyond the previous one. The higher the scaling value, the more active (bursty) the event happens in each level.
The "Bursting States" parameter determines how many bursting states there will be, beyond the non-bursting state. An i value of bursting states is equals to i + 1 automaton states.
The "Date Column" parameter is the name of the column with date/time when the events / topics happens.
The "Date Format" specifies how the date column will be interpreted as a date/time. See http://docs.oracle.com/javase/6/docs/api/java/text/SimpleDateFormat.html for details.
The "Text Column" parameter is the name of the column with values (delimiter and tokens) to be computed for bursting results.
The "Text Separator" parameters determines the separator that was used to delimit the tokens in the text column.
View the file "Burst detection analysis (Publication Year, Title): maximum burst level 1". On a PC running Windows, right click on this table and select view to see the data in Excel. On a Mac or a Linux system, right click and save the file, then open using the spreadsheet program of your choice.
In this table, there are six columns: "Word," "Level," "Weight," "Length," "Start," and "End."
The "Word" field identifies the specific character string which was detected as a "burst." The "Length" field indicates how long the burst lasted (over the selected time parameter).
The "Level" is the burst level of this burst. The higher burst level, the more frequent the event / topic happens.
The "Weight" field is the weight of this burst between its "Length". A higher weight could be resulted by the longer "Length", the higher "Level" or both.
The "Length" is the period of the burst. It is generated based on (Start - End + 1).
The "Start" field identifies when the burst began (again, according to the specified time parameter).
And the "End" field indicates when the burst stopped. An empty value in the "End" field indicates that the burst lasted until the last date present in the dataset. Where the "End" field is empty, manually add the last year present in the dataset; in this case, 2006.
After you manually add this information, save this .csv file somewhere in your computer. Reload the .csv file into Sci2 using 'File > Load'. Select 'Standard csv format' int the pop-up window. A new table will appear in the Data Manager. To visualize the table that contains the results of the Burst Detection algorithm, select the table you just loaded in the Data Manager and run 'Visualization > Temporal > Temporal Bar Graph' with the following parameters:
Temporal bar graphs are used to visualize numeric data over time, generating labeled horizontal bars. A PostScript file containing the horizontal bar graph will appear in the Data Manager.
Open and view the file using the workflow from Section 2.4 Saving Visualizations for Publication.
The resulting analysis indicates a change in the research focus of Alessandro Vespignani for publications beginning in 2001. For example, the bursting terms "fractal," "growth," "transform," and "fix" starting at 1990 are related to Vespignani's Ph.D., entitled "Fractal Growth and Self-Organized Criticality" in Physics. Other bursts also related to Physics follow these, such as "sandipil." After 2001, bursting terms like "complex," "network," "free," and "weight" appear, signifying a change in Vespignani's research area from Physics to Complex Networks, with a larger number of publications on topics like "weighted networks" and "scale-free networks."
Now, let's run the Burst Detection algorithm again for the same dataset but for a different value for the 'Gamma' parameter. Select the table 'with normalized Title' in the Data Manager and run 'Analysis > Topical > Burst Detection' with the following parameters:
Notice that the value for the gamma parameter is now set to 0.5. The parameter gamma controls the ease with which the automaton can change states. With a smaller gamma value, more bursts will be generated. Running the algorithm with these parameters will generate a new table named "Burst detection analysis (Publication Year, Title): maximum burst level 1.2" in the Data Manager.
Again where the "End" field is empty, manually add the last year present in the dataset; in this case, 2006.
After you manually add this information, save this .csv file somewhere in your computer. Reload this .csv file into Sci2 using 'File > Load'. Select 'Standard csv format' int the pop-up window. A new table will appear in the Data Manager. To visualize these table that contains these new results for the Burst Detection algorithm, select the table you just loaded in the Data Manager and run 'Visualization > Temporal > Horizontal Bar Graph (not included version)' with the same parameters.
A new PostScript file containing the horizontal bar graph will appear in the Data Manager. Once more, open and view the file using the workflow from Section 2.4 Saving Visualizations for Publication.
As expected, a larger number of bursts appear, and the new bursts have a smaller weight that those depicted in the first graph. These smaller, more numerous bursting terms permit a more detailed view of the dataset and allow the identification of trends. The "protein" burst starting in 2003, for example, indicates the year in which Alessandro Vespignani started to work with "protein-protein interaction networks," while the burst "epidem" - also from 2001 - is related to the application of complex networks to the analysis of epidemic phenomena in biological networks.
The original dataset for Alessandro Vespignani was created in 2006. If you wish to update the dataset to gain an understanding for how his research has changed and evolved since 2006 you can obtain a new dataset from the from Web of Science, see 22.214.171.124 ISI Web of Science. However, another way to obtain an individual researcher's publication information is to use their Google Scholar profile, if they have one. One of the biggest benefits to using a Google Scholar profile is that you will get publications not indexed in Web of Science, such as some book chapters. In this example, we will obtain the publication information for Alessandro Vespignani using Google Scholar:
Open Google Scholar in a web browser and search for "Alessandro Vespignani":
If the author or investigator you have searched for a Google Scholar profile, you will see a link to their profile at the top of the results page:
Keep in mind that not every author you search will necessarily have a Google Scholar profile, but for those that do, this is a very useful way to get their publication information. Click on the link to view Alessandro Vespignani's profile, and then select all publications and click the export button at the top of his publication list to export the citation information:
The easiest way to import the citation data into Sci2 is to export the data as a CSV file:
After you have specified the export format you can save the CSV file to your desired location by clicking the "Export all articles by Alessandro Vespignani" button. Save the file to your desktop and then load it into Sci2 in the standard CSV format:
Once the data is in Sci2, you will need to normalize the text for the titles before you can run Burst Detection. Run 'Preprocessing > Topical > Lowercase, Tokenize, Stem, and Stopword Text' and select the title parameter:
After you normalize the text for the title field you will notice a "with normalized Title" file in the data manager. You will likely need to edit this file before you can run Burst Detection. Right click on the file in the data manager and select view:
This will open the dataset in Excel (or you preferred spreadsheet editor). You will notice that the Lowercase, Tokenize, Stem, and Stopword Text algorithm has place brackets around the years. You will need to remove these before you can run the Burst Detection algorithm. In Excel, hit 'Ctrl-F' on the keyboard. This will bring up the Find and Replace tool. Highlight the column of years and then perform a find and replace:
You will have to repeat this for the other bracket symbol. This will essentially allow you remove the brackets around the years. Next you will need to remove those publications for which there is no year information. Burst Detection will not run if there are empty values in the date column. You can search for the publications and find the proper date, but the year value could be empty because these are forthcoming publications. In this example, we will just remove all publications without a value in the year column:
You will need to save this file to your desktop and re-load it into Sci2. Then, select the file you have just loaded and run 'Analysis > Topical > Burst Detection' and enter the following parameters:
This will result in a "Burst detection analysis (Year, Title): maximum burst level 1" file in the data manager Right click on this file to view the data:
You will need to edit the data before you can run the Temporal Bar Graph algorithm to visualize the results of the burst detection. First, you should make sure every record has an "End" date or the Temporal Bar Graph will not run properly. We know that this dataset contains records that are labeled with the year of 2013, so that will be our end date for those bursts that are still continuing:
Before you can visualize the results with the Temporal Bar Graph it is important to know that if you want to size bars based on weight, the weight value will be distributed across the length of the burst. In other words, the total area of the bar corresponds to the weight value. This means you can have a bar with a high weight value that appears thinner, compared to bar with a lower weight value if the former burst occurs over a longer period than the latter. Finally, before you visualize this dataset, you can add some categories to allow you to color your bars. For example you can sort the records from largest to smallest based on the "total weight" column and assign strong, medium, and weak categories to these records based on the "total weight" values:
Now, save the file to your desktop and reload it into Sci2 in the standard CSV format and run 'Visualization > Temporal > Temporal Bar Graph', entering the following parameters:
Note that if you select the "Simplified Layout" option no legend will be created for the map. This allows you to create your own legend that will be accurate based creating new weight values. To learn how to create a legend for your visualization see 2.4 Saving Visualizations for Publication.
To view the visualization, save the file from the data manager by right-clicking and selecting save:
Make sure to save the visualization as a PostScript file:
Save the PostScript file to your desktop, and if you have a version of the Adobe Creative Suite on your machine you can simply double-click the PostScript file to launch Adobe Distiller and automatically convert the PostScript file into a PDF for viewing. However, if you do not have a copy of the Adobe Creative Suite installed on your machine, you can use an online version of GhostScript to convert PostScript files to PDF files: http://ps2pdf.com/. The resulting visualization should look similar to the following:
Remember that the weight for the bars is equal to the total area, not simply the thickness. So, including the color categories will help users make more sense of the visualization. You notice that this burst analysis for Alessandro Vesipignani's publications looks similar to the one created in the previous section. However, this new burst analysis takes into consideration his more recent publications and interests in human mobility networks and epidemiology. This workflow can easily be repeated using any author who has a profile in Google Scholar. Give it a try for yourself!
126.96.36.199 Visualizing Burst Detection in Excel
Its possible to generate a visualization for burst analysis in MS Excel. For this, open the results of the first burst analysis conducted ('Burst detection analysis (Publication Year, Title): maximum burst level 1') in MS Excel, by right clicking on this table in the Data Manager and selecting View.
To generate a visual depiction of the bursts in MS Excel perform the following steps:
1. Sort the data ascending by burst start year.
2. Add column headers for all years, i.e., enter first the start year in the cell of index G1, here 1990. As stated before, when there is no value in the "End" field that indicates that the burst lasted until the last date present in the dataset. So continue, e.g., using formula '=G1+1', until highest burst end year, here 2006 in cell W1.
3. In the resulting word by burst year matrix, select the upper left blank cell (G2) and select 'Conditional Formatting' from the ribbon. Then select 'Data Bars > More Rules > Use a formula to determine which cells to format.' To color cells for years with a burst weight value of more or equal 10 red and cells with a higher value dark red use the following formulas and format patterns:
Select 'OK' and then repeat step three, using the formula below:
4. Once both formatting rules have been established, select 'Conditional Formatting > Manage Rules', highlight the first formatting rule and move to the top of the list:
5. Make sure both formatting rules are selected and apply them to current selection. Apply the format to all cells in the word by year matrix by dragging the box around cell G2 to highlight all cells in the matrix. The result for the given example is shown in Figure 5.33
Figure 5.32.1: Visualizing burst results in MS Excel