Child pages
  • 5.1.4 Studying Four Major NetSci Researchers (ISI Data)

Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

Table of Contents



Time frame:




Topical Area(s):

Network Science

Analysis Type(s):

Paper Citation Network, Co-Author Network, Bibliographic Coupling Network, Document Co-Citation Network, Word Co-Occurrence Network


Load the file using 'File > Load' and following this path: 'yoursci2directory/sampledata/scientometrics/isi/FourNetSciResearchers.isi' (if the file is not in the sample data directory it can be downloaded from 2.5 Sample Datasets).  A table of all records and a table of 361 records with unique ISI ids will appear in the Data Manager. In this "clean" file, each original record now has a "Cite Me As" attribute that is constructed from the first author, publication year (PY), journal abbreviation (J9), volume (VL), and beginning page (BP) fields of its ISI record. This "Cite Me As" attribute will be used when matching paper and reference records.

New ISI File Format

Web of Science made a change to their output format in September, 2011. Older versions of Sci2 tool (Older than v0.5.2 alpha) may refuse to load these new files, with an error like "Invalid ISI format file selected."

Sci2 solution

If you are using an older version of the Sci2 tool, you can download the file and unzip the JAR files into your sci2/plugins/ directory. Restart Sci2 to activate the fixes. You can now load the downloaded ISI files into the Sci2 without any additional step. If you are using the old Sci2 tool you will need to follow the guidelines below before you can load the new WOS format file into the tool.

You can fix this problem for individual files by opening them in Notepad (or your favorite text editor). The file will start with the words:

Original ISI file:

Just add the word ISI.

Updated ISI file:

And then Save the file.

The ISI file should now load properly. More information on the ISI file format is available here (*.isi%29).


titleAggregate Function File

Make sure to use the aggregate function file indicated in the image below. Aggregate function files can be found in sci2/sampledata/scientometrics/properties.

The result is a directed network of paper citations in the Data Manager. Each paper node has two citation counts. The local citation count (LCC) indicates how often a paper was cited by papers in the set. The global citation count (GCC) equals the times cited (TC) value in the original ISI file. Only references from other ISI records count towards an ISI paper's GCC value. Currently, the Sci2Sci2 Tool sets the GCC of references to -1 (except for references that are not also ISI records) to prune the network to contain only the original ISI records.


  1. Resize Linear > Nodes > globalcitationcount> From: 1 To: 50 > When the nodes have no 'globalcitationcount': 0.1 > Do Resize Linear
  2. Colorize > Nodes > globalcitationcount > From:   To:   >  (When the nodes have no 'globalcitationcount': 0.1 >   >Do Colorize)
  3. Colorize > Edges > weight > From (select the "RGB" tab) 127, 193, 65 To: (select the "RGB" tab) 0, 0, 0
  4. Type in Interpreter:


The complete network can be reduced to papers that appeared in the original ISI file by deleting all nodes that have a GCC of -1. Simply run 'Preprocessing > Networks > Extract Nodes Above or Below Value' with parameter values:

Image RemovedImage Added

The resulting network is unconnected, i.e., it has many subnetworks many of which have only one node. These single unconnected nodes, also called isolates, can be removed using 'Preprocessing > Networks > Delete Isolates'. Deleting isolates is a memory intensive procedure. If you experience problems at this step, refer to Section 3.4 Memory Allocation.


The largest component has 163 2407 nodes; the second largest, 45307; the third, 2413; and the fourth has 12 7 nodes. The largest component is shown in Figure 5.12. The top 20 papers, by times cited in ISI, have been labeled using


Compare the result with Figure 5.11 and note that this network layout algorithm – and most others – are non-deterministic: different runs lead to different layouts. That said, all layouts aim to group connected nodes into spatial proximity while avoiding overlaps of unconnected or sparsely connected subnetworks.sub-networks.

To see the log file from this workflow save the Paper-Paper (Citation) Network log file.

Anchor Author Co-Occurrence (Co-Author) Network Author Co-Occurrence (Co-Author) Network Author Co-Occurrence (Co-Author) Network

To produce a co-authorship network in the Sci2Sci2 Tool, select the table of all 361 unique ISI records from the 'FourNetSciResearchers' dataset in the Data Manager window. Run 'Data Preparation > Extract Co-Author Network' using the parameter:


Table 5.2: Merging of author nodes using the merge table

A merge table can be automatically generated by applying the Jaro distance metric (Jaro, 1989, 1995) available in the open source Similarity Measure Library (  to   to identify potential duplicates. In the Sci2Sci2 Tool, simply select the co-author network and run 'Data Preparation > Detect Duplicate Nodes'. using the parameters:


Code Block
> resizeLinear(numberofworksnumberOfWorks,1,50)
> colorize(numberofworksnumberOfWorks,gray,black)
> for n in g.nodes:
      n.strokecolor = n.color
> resizeLinear(numberofcoauthoredworksnumberOfCoAuthored_works, .25, 8)
> colorize(numberofcoauthoredworksnumberOfCoAuthoredworks, "127,193,65,255", black)
> nodesbynumworks = g.nodes[:]
> def bynumworks(n1, n2):
     return cmp(n1.numberofworks, n2.numberofworks)
> nodesbynumworks.sort(bynumworks)
> nodesbynumworks.reverse()
> for i in range(0, 50):
      nodesbynumworks[i].labelvisible = true


If the network being processed is undirected, which is the case, then MST-Pathfinder Network Scaling can be used to prune the networks. This will give produce results in 30 times faster than Fast Pathfinder Network Scaling. Also, we have found that networks that which have a low standard deviation for edge weights, or if that have many of the edge weights that are equal to the minimum edge weight, then the network might not be scaled as much as expected when using Fast Pathfinder Network Scaling. To see this behavior, run 'Preprocessing > Networks > MST-Pathfinder Network Scaling' with the network named 'Updated network' selected with the following parameters:


This second type of output file is particularly suitable to study skewed distributions: the fact that the size of the bins grows large for large degree values compensates for the fact that not many nodes have high degree values, so it suppresses the fluctuations that one would observe by using bins of equal size. On a double logarithmic scale, which is very useful to determine the possible power law behavior of the distribution, the points of the latter will appear equally spaced on the x-axis.

Visualize also this second output file with 'Visualization > General > GnuPlot':

Community Detection

Community Detection algorithms look for subgraphs where nodes are highly interconnected among themselves and poorly connected with nodes outside the subgraph. Many community detection algorithms are based on the optimization of the modularity - a scalar value between -1 and 1 that measures the density of links inside communities as compared to links between communities. The Blondel Community Detection finds high modularity partitions of large networks in short time and that unfolds a complete hierarchical community structure for the network, thereby giving access to different resolutions of community detection.


To view to the network with community attributes with in GUESS select the network used to generate the image above and run 'Visualization > Networks > GUESS.'

To see the log file from this workflow save the Author Co-Occurrence (Co-Author) Network log file. Cited Reference Co-Occurrence (Bibliographic Coupling) Network

In Sci2Sci2, a bibliographic coupling network is derived from a directed paper citation network (see section Document-Document (Citation) Network).


Code Block
> resizeLinear(globalcitationcount,2,40)
> colorize(globalcitationcount,(200,200,200),(0,0,0))gray,black)
> resizeLinear(weight,.25,8)
> colorize(weight, "127,193,65,255", black)
> for n in g.nodes:
> toptc = g.nodes[:]
> def bytc(n1, n2):
      return cmp(n1.globalcitationcount, n2.globalcitationcount)
> toptc.sort(bytc)
> toptc.reverse()
> toptc
> for i in range(0, 20):
      toptc[i].labelvisible = true


For both workflows described above, the final step should be to run 'Layout > GEM' and then 'Layout > Bin Pack' to give a better representation of node clustering.

Figure 5.14: Reference co-occurrence network layout for 'FourNetSciResearchers' dataset 

To see the log file from this workflow save the Cited Reference Co-Occurrence (Bibliographic Coupling) Network log file. Document Co-Citation Network (DCA)


Figure 5.15: Undirected, weighted bibliographic coupling network (left) and undirected, weighted co-citation network (right) of 'FourNetSciResearchers' dataset, with isolate nodes removed

To see the log file from this workflow save the Document Co-Citation Network (DCA) log file.

Anchor Word Co-Occurrence Network Word Co-Occurrence Network Word Co-Occurrence Network



In the Sci2 Tool, select "361 unique ISI Records" from the 'FourNetSciResearchers' dataset in the Data Manager. Run 'Preprocessing > Topical > Lowercase, Tokenize, Stem, and Stopword Text' using the following parameters:


Text normalization utilizes the Standard Analyzer provided by Lucene (|). It separates text into word tokens, normalizes word tokens to lower case, removes "s" from the end of words, removes dots from acronyms, deletes stop words, and applies the English Snowball stemmer (, which is a version of the Porter2 stemmer designed for the English language..

The result is a derived table – "with normalized Abstract" – in which the text in the abstract column is normalized. Select this table and run 'Data Preparation > Extract Word Co-Occurrence Network' using parameters:

titleAggregate Function File

Make sure to If you are working with ISI data, you can use the aggregate function file indicated in the image below. Aggregate function files can be found in sci2/sampledata/scientometrics/properties.


If you are not working with ISI data and wish to create your own aggregate function file, you can find more information in 3.6 Property Files

Image Added

The outcome is a network in which nodes represent words and edges and denote their joint appearance in a paper. Word co-occurrence networks are rather large and dense. Running the 'Analysis > Networks > Network Analysis Toolkit (NAT)' reveals that the network has 2,821 word nodes and 242,385 co-occurrence edges.


Note that only the top 1000 edges (by weight) in this large network appear in the above visualization, creating the impression of isolate nodes. To remove nodes that are not connected by the top 1000 edges (by weight), run 'Preprocessing > Networks > Delete Isolates' on the "top 1000 edges by weight" network and visualize the result using the workflow described above.

To see the log file from this workflow save the Word Co-Occurrence Network log file.

Database Extractions

titleExtended Version

This workflow uses the extended version of the Sci2 Tool. To know how to extend Sci2 view Section 3.2 Additional Plugins.



The database plugin is not currently available for the most recent version of Sci2 (v1.0 aplpha). However, the plugin that allows files to be loaded as databases is available for Sci2 v0.5.2 alpha or older. Please check the Sci2 news page ( We will update this page when a database plugin becomes available for the latest version of the tool.

The Sci2 Tool supports the creation of databases from ISI files. Database loading improves the speed and functionality of data preparation and preprocessing. While the initial loading can take quite some time for larger datasets (see sections 3.4 Memory Allocation and 3.5 Memory Limits) it results in vastly faster and more powerful data processing and extraction.


View the file "Burst detection analysis (Publication Year, Reference): maximum burst level 1". On a PC running Windows, right click on this table and select view to see the data in Excel. On a Mac or a Linux system, right click and save the file, then open using the spreadsheet program of your choice. See Burst Detection for the meaning of each field in the output.

A An empty value in the "End" field indicates that the burst lasted until the last date present in the dataset. Where the "End" field is empty, put manually add the last year present in the dataset. In this case, 2007.

After you manually add manually this information, save this .csv file somewhere in your computer. Load back this .csv file into Sci2 using 'File > Load'. Select 'Standart csv format' int the pop-up window. A new table will appear in the Data Manager. To visualize these this table that contains the results of the Burst Detection algorithm, select the table you just loaded in the Data Manager and run 'Visualization > Temporal > Horizontal Bar Graph' with the following parameters:


Figure 5.21: Longitudinal study of 'FourNetSciResearchers,' visualized in GUESS

Using Sci2Sci2's database functionality allows for several network extractions that cannot be achieved with the text-based algorithms. For example, extracting journal co-citation networks reveals which journals are cited together most frequently. Run 'Data Preparation > Database > ISI > Extract Document Co-Citation Network (Core and References)' on the database to create a network of co-cited journals, and then prune it using 'Preprocessing > Networks > Extract Edges Above or Below Value' with the parameters: