|Table of Contents|
Paper Citation Network, Co-Author Network, Bibliographic Coupling Network, Document Co-Citation Network, Word Co-Occurrence Network
Load the file using 'File > Load' and following this path: 'yoursci2directory/sampledata/scientometrics/isi/FourNetSciResearchers.isi' (if the file is not in the sample data directory it can be downloaded from 2.5 Sample Datasets). A table of all records and a table of 361 records with unique ISI ids will appear in the Data Manager. In this "clean" file, each original record now has a "Cite Me As" attribute that is constructed from the first author, publication year (PY), journal abbreviation (J9), volume (VL), and beginning page (BP) fields of its ISI record. This "Cite Me As" attribute will be used when matching paper and reference records.
New ISI File Format
Web of Science made a change to their output format in September, 2011. Older versions of Sci2 tool (Older than v0.5.2 alpha) may refuse to load these new files, with an error like "Invalid ISI format file selected."
If you are using an older version of the Sci2 tool, you can download the WOS-plugins.zip file and unzip the JAR files into your sci2/plugins/ directory. Restart Sci2 to activate the fixes. You can now load the downloaded ISI files into the Sci2 without any additional step. If you are using the old Sci2 tool you will need to follow the guidelines below before you can load the new WOS format file into the tool.
You can fix this problem for individual files by opening them in Notepad (or your favorite text editor). The file will start with the words:
Original ISI file:
Just add the word ISI.
Updated ISI file:
And then Save the file.
The ISI file should now load properly. More information on the ISI file format is available here (http://wiki.cns.iu.edu/display/CISHELL/ISI+%28*.isi%29).
Make sure to use the aggregate function file indicated in the image below. Aggregate function files can be found in sci2/sampledata/scientometrics/properties.
The result is a directed network of paper citations in the Data Manager. Each paper node has two citation counts. The local citation count (LCC) indicates how often a paper was cited by papers in the set. The global citation count (GCC) equals the times cited (TC) value in the original ISI file. Only references from other ISI records count towards an ISI paper's GCC value. Currently, the Sci2Sci2 Tool sets the GCC of references to -1 (except for references that are not also ISI records) to prune the network to contain only the original ISI records.
- Resize Linear > Nodes > globalcitationcount> From: 1 To: 50 > When the nodes have no 'globalcitationcount': 0.1 > Do Resize Linear
- Colorize > Nodes > globalcitationcount > From: > (When the nodes have no 'globalcitationcount': 0.1 > >Do Colorize) To:
- Colorize > Edges > weight > From (select the "RGB" tab) 127, 193, 65 To: (select the "RGB" tab) 0, 0, 0
- Type in Interpreter:
>for n in g.nodes: n.strokecolor = n.color
Or, select the 'Interpreter' tab at the bottom, left-hand corner of the GUESS window, and enter the command lines:
> resizeLinear(globalcitationcount,1,50) > colorize(globalcitationcount,gray,black) > for e in g.edges: e.color="127,193,65,255"
The complete network can be reduced to papers that appeared in the original ISI file by deleting all nodes that have a GCC of -1. Simply run 'Preprocessing > Networks > Extract Nodes Above or Below Value' with parameter values:
The resulting network is unconnected, i.e., it has many subnetworks many of which have only one node. These single unconnected nodes, also called isolates, can be removed using 'Preprocessing > Networks > Delete Isolates'. Deleting isolates is a memory intensive procedure. If you experience problems at this step, refer to Section 3.4 Memory Allocation.
The 'FourNetSciResearchers' dataset has exactly 65 isolates. Removing those leaves 12 networks shown in Figure 5.11 (right) using the same color and size coding as in Figure 5.11 (left). Using 'View > Information Window' in GUESS reveals detailed information for any node or edge.
Alternatively, nodes could have been color and/or size coded by their degree using, e.g.:
> g.computeDegrees() > colorize(outdegree,gray,black)
The largest component has 2407 nodes; the second largest, 307; the third, 13; and the fourth has 7 nodes. The largest component is shown in Figure 5.12. The top 20 papers, by times cited in ISI, have been labeled using
> toptc = g.nodes[:] > def bytc(n1, n2): return cmp(n1.globalcitationcount, n2.globalcitationcount) > toptc.sort(bytc) > toptc.reverse() > toptc > for i in range(0, 20): toptc[i].labelvisible = true
Compare the result with Figure 5.11 and note that this network layout algorithm – and most others – are non-deterministic: different runs lead to different layouts. That said, all layouts aim to group connected nodes into spatial proximity while avoiding overlaps of unconnected or sparsely connected subnetworks.sub-networks.
To see the log file from this workflow save the 22.214.171.124log file.
To produce a co-authorship network in the Sci2Sci2 Tool, select the table of all 361 unique ISI records from the 'FourNetSciResearchers' dataset in the Data Manager window. Run 'Data Preparation > Extract Co-Author Network' using the parameter:
Table 5.2: Merging of author nodes using the merge table
A merge table can be automatically generated by applying the Jaro distance metric (Jaro, 1989, 1995) available in the open source Similarity Measure Library (http://sourceforge.net/projects/simmetrics/) to to identify potential duplicates. In the Sci2Sci2 Tool, simply select the co-author network and run 'Data Preparation > Detect Duplicate Nodes'. using the parameters:
The updated co-authorship network can be visualized using 'Visualization > Networks > GUESS', (See section 126.96.36.199 GUESS Visualizations for more information regarding GUESS).
Figure 5.13 shows the layout of the combined 'FourNetSciResearchers' dataset after it was modified using the following commands in the "Interpreter":
> resizeLinear(numberofworksnumberOfWorks,1,50) > colorize(numberofworksnumberOfWorks,gray,black) > for n in g.nodes: n.strokecolor = n.color > resizeLinear(numberofcoauthoredworksnumberOfCoAuthored_works, .25, 8) > colorize(numberofcoauthoredworksnumberOfCoAuthoredworks, "127,193,65,255", black) > nodesbynumworks = g.nodes[:] > def bynumworks(n1, n2): return cmp(n1.numberofworks, n2.numberofworks) > nodesbynumworks.sort(bynumworks) > nodesbynumworks.reverse() > for i in range(0, 50): nodesbynumworks[i].labelvisible = true
If the network being processed is undirected, which is the case, then MST-Pathfinder Network Scaling can be used to prune the networks. This will give produce results in 30 times faster than Fast Pathfinder Network Scaling. Also, we have found that networks that which have a low standard deviation for edge weights, or if that have many of the edge weights that are equal to the minimum edge weight, then the network might not be scaled as much as expected when using Fast Pathfinder Network Scaling. To see this behavior, run 'Preprocessing > Networks > MST-Pathfinder Network Scaling' with the network named 'Updated network' selected with the following parameters:
This second type of output file is particularly suitable to study skewed distributions: the fact that the size of the bins grows large for large degree values compensates for the fact that not many nodes have high degree values, so it suppresses the fluctuations that one would observe by using bins of equal size. On a double logarithmic scale, which is very useful to determine the possible power law behavior of the distribution, the points of the latter will appear equally spaced on the x-axis.
Visualize also this second output file with 'Visualization > General > GnuPlot':
Community Detection algorithms look for subgraphs where nodes are highly interconnected among themselves and poorly connected with nodes outside the subgraph. Many community detection algorithms are based on the optimization of the modularity - a scalar value between -1 and 1 that measures the density of links inside communities as compared to links between communities. The Blondel Community Detection finds high modularity partitions of large networks in short time and that unfolds a complete hierarchical community structure for the network, thereby giving access to different resolutions of community detection.
To view to the network with community attributes with in GUESS select the network used to generate the image above and run 'Visualization > Networks > GUESS.'
To see the log file from this workflow save the 188.8.131.52 Author Co-Occurrence (Co-Author) Network log file.
184.108.40.206 Cited Reference Co-Occurrence (Bibliographic Coupling) Network
In Sci2Sci2, a bibliographic coupling network is derived from a directed paper citation network (see section 220.127.116.11.1 Document-Document (Citation) Network).
This network can be visualized in GUESS; see Figure 5.14. Nodes and edges can be color and size coded, and the top 20 most-cited papers can be labeled by entering the following lines in the GUESS "Interpreter":
> resizeLinear(globalcitationcount,2,40) > colorize(globalcitationcount,(200,200,200),(0,0,0))gray,black) > resizeLinear(weight,.25,8) > colorize(weight, "127,193,65,255", black) > for n in g.nodes: n.strokecolor=n.color > toptc = g.nodes[:] > def bytc(n1, n2): return cmp(n1.globalcitationcount, n2.globalcitationcount) > toptc.sort(bytc) > toptc.reverse() > toptc > for i in range(0, 20): toptc[i].labelvisible = true
For both workflows described above, the final step should be to run 'Layout > GEM' and then 'Layout > Bin Pack' to give a better representation of node clustering.
Figure 5.14: Reference co-occurrence network layout for 'FourNetSciResearchers' dataset
To see the log file from this workflow save the 18.104.22.168 Cited Reference Co-Occurrence (Bibliographic Coupling) Network log file.
22.214.171.124 Document Co-Citation Network (DCA)
Figure 5.15: Undirected, weighted bibliographic coupling network (left) and undirected, weighted co-citation network (right) of 'FourNetSciResearchers' dataset, with isolate nodes removed
To see the log file from this workflow save thelog file.
In the Sci2 Tool, select "361 unique ISI Records" from the 'FourNetSciResearchers' dataset in the Data Manager. Run 'Preprocessing > Topical > Lowercase, Tokenize, Stem, and Stopword Text' using the following parameters:
Text normalization utilizes the Standard Analyzer provided by Lucene (http://lucene.apache.org|). It separates text into word tokens, normalizes word tokens to lower case, removes "s" from the end of words, removes dots from acronyms, deletes stop words, and applies the English Snowball stemmer (http://snowball.tartarus.org/algorithms/english/stemmer.html), which is a version of the Porter2 stemmer designed for the English language..
The result is a derived table – "with normalized Abstract" – in which the text in the abstract column is normalized. Select this table and run 'Data Preparation > Extract Word Co-Occurrence Network' using parameters:
Make sure to If you are working with ISI data, you can use the aggregate function file indicated in the image below. Aggregate function files can be found in sci2/sampledata/scientometrics/properties.
If you are not working with ISI data and wish to create your own aggregate function file, you can find more information in 3.6 Property Files
The outcome is a network in which nodes represent words and edges and denote their joint appearance in a paper. Word co-occurrence networks are rather large and dense. Running the 'Analysis > Networks > Network Analysis Toolkit (NAT)' reveals that the network has 2,821 word nodes and 242,385 co-occurrence edges.
Once edges have been removed, the network "top 1000 edges by weight" can be visualized by running 'Visualization > Networks > GUESS'. In GUESS, run the following commands in the Interpreter:
> for node in g.nodes: node.x = node.xpos * 40 node.y = node.ypos * 40 > resizeLinear(references, 2, 40) > colorize(references,[200,200,200],[0,0,0]) > resizeLinear(weight, .1, 2) > g.edges.color = "127,193,65,255"
Note that only the top 1000 edges (by weight) in this large network appear in the above visualization, creating the impression of isolate nodes. To remove nodes that are not connected by the top 1000 edges (by weight), run 'Preprocessing > Networks > Delete Isolates' on the "top 1000 edges by weight" network and visualize the result using the workflow described above.
To see the log file from this workflow save the 126.96.36.199 Word Co-Occurrence Network log file.
This workflow uses the extended version of the Sci2 Tool. To know how to extend Sci2 view Section 3.2 Additional Plugins.
The database plugin is not currently available for the most recent version of Sci2 (v1.0 aplpha). However, the plugin that allows files to be loaded as databases is available for Sci2 v0.5.2 alpha or older. Please check the Sci2 news page (https://sci2.cns.iu.edu/user/news.php). We will update this page when a database plugin becomes available for the latest version of the tool.
The Sci2 Tool supports the creation of databases from ISI files. Database loading improves the speed and functionality of data preparation and preprocessing. While the initial loading can take quite some time for larger datasets (see sections 3.4 Memory Allocation and 3.5 Memory Limits) it results in vastly faster and more powerful data processing and extraction.
View the file "Burst detection analysis (Publication Year, Reference): maximum burst level 1". On a PC running Windows, right click on this table and select view to see the data in Excel. On a Mac or a Linux system, right click and save the file, then open using the spreadsheet program of your choice. See Burst Detection for the meaning of each field in the output.
A An empty value in the "End" field indicates that the burst lasted until the last date present in the dataset. Where the "End" field is empty, put manually add the last year present in the dataset. In this case, 2007.
After you manually add manually this information, save this .csv file somewhere in your computer. Load back this .csv file into Sci2 using 'File > Load'. Select 'Standart csv format' int the pop-up window. A new table will appear in the Data Manager. To visualize these this table that contains the results of the Burst Detection algorithm, select the table you just loaded in the Data Manager and run 'Visualization > Temporal > Horizontal Bar Graph' with the following parameters:
The largest speed increases from the database functionality can be found in the extraction of networks. First, compare the results of a co-authorship extraction with those from section 188.8.131.52 Author Co-Occurrence (Co-Author) Network. Run 'Data Preparation > Database > ISI > Extract Co-Author Network' followed by 'Analysis > Networks > Network Analysis Toolkit (NAT)'. Notice that both networks have 247 nodes and 891 edges. Visualize the extracted co-author network in GUESS using 'Visualization > Networks > GUESS' and reformat the visualization using 'Layout > GEM' and 'Layout > Bin Pack.' To apply the default co-authorship theme, go to 'Script > Run Script' and find 'yoursci2directory/scripts/GUESS/co-author-nw_database.py'. The resulting network will look like Figure 5.21.
Figure 5.21: Longitudinal study of 'FourNetSciResearchers,' visualized in GUESS
Using Sci2Sci2's database functionality allows for several network extractions that cannot be achieved with the text-based algorithms. For example, extracting journal co-citation networks reveals which journals are cited together most frequently. Run 'Data Preparation > Database > ISI > Extract Document Co-Citation Network (Core and References)' on the database to create a network of co-cited journals, and then prune it using 'Preprocessing > Networks > Extract Edges Above or Below Value' with the parameters: