Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The second part of Katy Börner's research profile will focus on her Co-PIs. The data can be downloaded for free using NSF's Award Search (See section 4.2.2.1 NSF Award Search) by searching for "Katy Borner" in the "Principal Investigator" field and keeping the "Include CO-PI" box checked.

Load the NSF data 'yoursci2directory/sampledata/scientometrics/nsf/KatyBorner.nsf' using 'File > Load'. Select NSF csv format from the 'Load' pop-up window. Make sure the loaded dataset in the Data Manager window is highlighted in blue, and run 'Data Preparation > Text Files > Extract Co-Occurrence Network' using these parameters:

...

Select the "Extracted Network on Column All Investigators" network and run 'Analysis >Networks > Network Analysis Toolkit (NAT)' to reveal that there are 13 nodes and 28 edges without isolates in the network. Click on "Extracted Network on Column All Investigators" and select 'Visualization > Networks > GUESS' to visualize the resulting Co-PI network. Select 'GEM' from the layout menu.

Load the default Co-PI visualization theme via 'Script > Run Script ...'and load 'yoursci2directory/scripts/GUESS/co-PI-nw.py'. Alternatively, use the "Graph Modifier" to customize the visualization. The resulting network in Figure 5.2 was modified using the following workflow:

...

"Slice Into" allows the user to slice the table by days, weeks, months, quarters, years, decades, and centuries. There are two additional parameters for time slicing: cumulative and align with calendar. The former produces tables containing all data from the beginning to the end of each table's time interval, which can be seen in the Data Manager and below:

The latter option aligns the output tables according to calendar intervals:

Choosing "Years" under "Slice Into" creates multiple tables beginning from January 1st of the first year. If "Months" is chosen, it will start from the first day of the earliest month in the chosen time interval.

To see the evolution of Vespignani's co-authorship network over time, check "cumulative". Then, extract co-authorship networks one at a time for each sliced time table using Data Preparation > Text Files > Extract Co-Author Network', making sure to select "ISI" from the pop-up window during the extraction. Visualize the evolving network using GUESS as shown in Figure 5.4.

1990-1991 1990-1996

1990-2001 1990-2006
Figure 5.4: Evolving co-authorship network of Vespignani from 1990-2006

...

It is often useful to compare the profiles of multiple researchers within similar disciplinary or institutional domains. To demonstrate this comparison, load the NSF funding profiles of three Indiana University researchers into the Sci2 Tool using 'File > Load' from 'yoursci2directory/sampledata/scientometrics/nsf'. Once 'GeoffreyFox.nsf', 'MichaelMcRobbie.nsf', and 'BethPlale.nsf' are loaded in NSF csv format, run 'Visualization > Temporal > Horizontal Bar Graph', using the recommended parameters for each.

For instructions on how to save and view the PostScript file generated by the "Horizontal Bar Graph" algorithm, see section 2.4 Saving Visualizations for Publication.

Select 'GeoffreyFox.nsf' in the Data Manager. Use the following parameters to generate a Horizontal Bar Graph:

...

Run 'Visualization > Networks > GUESS' on each generated network to visualize the resulting Co-PI relationships. Select 'GEM' from the layout menu to organize the nodes and edges.

To color and size the nodes and edges using the default Co-PI visualization theme, run 'yoursci2directory/scripts/GUESS/co-PI-nw.py' from 'Script > Run Script ...'.

...

Load the file 'yoursci2directory/sampledata/scientometrics/isi/FourNetSciResearchers.isi' using 'File > Load and Clean ISI File.' Choose "ISI scholarly format" in the pop-up 'Load' window. A table of all records and a table of 361 records with unique ISI ids will appear in the Data Manager. In this "clean" file, each original record now has a "Cite Me As" attribute that is constructed from the first author, publication year (PY), journal abbreviation (J9), volume (VL), and beginning page (BP) fields of its ISI record. This "Cite Me As" attribute will be used when matching paper and reference records.

To extract the paper citation network, select the '361 Unique ISI Records' table and run 'Data Preparation > Text Files > Extract Directed Network' using the parameters :

...

The result is a directed network of paper citations in the Data Manager. Each paper node has two citation counts. The local citation count (LCC) indicates how often a paper was cited by papers in the set. The global citation count (GCC) equals the times cited (TC) value in the original ISI file. Only references from other ISI records count towards an ISI paper's GCC value. Currently, the Sci2 Tool sets the GCC of references to -1 (except for references that are not also ISI records) to prune the network to contain only the original ISI records.

To view the complete network, select the "Network with directed edges from Cited References to Cite Me As" in the Data Manager and run 'Visualization > Networks > GUESS' and wait until the network is visible and centered. Because the FourNetSciResearchers dataset is so large, the visualization will take some time to load, even on powerful systems.

...

Note that the outdegree corresponds to the LCC within the given network while the indegree reflects the number of references, helping to visually identify review papers.

The complete paper-paper-citation network can be split into its subnetworks using 'Analysis > Networks > Unweighted & Directed > Weak Component Clustering' with the default values:

...

The log files describe, in a more human-readable form, which nodes will be merged or not merged. Specifically, the first log file provides information regarding which nodes will be merged, while the second log file lists nodes which are similar but will not be merged. The automatically generated merge table can be further modified as needed.

In sum, unification of author names can be done manually or automatically, independently or in conjunction with other data manipulation. It is recommended that users create the initial merge table automatically and fine-tune it as needed. Note that the same procedure can be used to identify duplicate references – simply select a paper-citation network and run 'Data Preparation > Text Files > Detect Duplicate Nodes' using the same parameters as above and a merge table for references will be created.

To merge identified duplicate nodes, select both the "Extracted Co-Authorship Network" and "Merge Table: based on label" by holding down the 'Ctrl' key. Run 'Data Preparation > Text Files > Update Network by Merging Nodes'. This will produce an updated network as well as a report describing which nodes were merged. To complete this workflow, an aggregation function file must also be selected from the pop-up window:

...

Alternatively, run 'Script > Run Script ...' and select 'yoursci2directory/scripts/GUESS/co-author-nw.py'.

For both workflows described above, the final step should be to run 'Layout > GEM' and then 'Layout > Bin Pack' to give a better representation of node clustering.

In the resulting visualization, author nodes are color and size coded by the number of papers per author. Edges are color and thickness coded by the number of times two authors wrote a paper together. The remaining commands identify the top 50 authors with the most papers and make their name labels visible.

...

Load the file 'yoursci2directory/sampledata/scientometrics/isi/FourNetSciResearchers.isi' using 'File > Load and Clean ISI File.' Choose "ISI scholarly format" in the pop-up 'Load' window. A table of all records and a table of 361 records with unique ISI ids will appear in the Data Manager.

Select the "361 Unique ISI Records" in the Data Manager and run 'Data Preparation > Text Files > Paper Citation Network.' Select "Extracted Paper Citation Network" and run 'Data Preparation > Text Files > ExtractReference Co-Occurrence (Bibliographic Coupling) Network.'

...

Running 'Analysis > Networks > Network Analysis Toolkit (NAT)' reveals that the network has 5,342 nodes (5,013 of which are isolate nodes) and 6,277 edges.

In the "Bibliographic Coupling Similarity Network," edges with low weights can be eliminated by running 'Preprocessing > Networks > Extract Edges Above or Below Value' with the following parameter values:

...

Alternatively, run 'GUESS: File > Run Script ...' and select 'yoursci2directory/scripts/GUESS/reference-co-occurence-nw.py'.

For both workflows described above, the final step should be to run 'Layout > GEM' and then 'Layout > Bin Pack' to give a better representation of node clustering.

Figure 5.14: Reference co-occurrence network layout for 'FourNetSciResearchers' dataset

...

Load the file 'yoursci2directory/sampledata/scientometrics/isi/FourNetSciResearchers.isi' using 'File > Load and Clean ISI File.' Choose "ISI scholarly format" in the pop-up 'Load' window. A table of all records and a table of 361 records with unique ISI ids will appear in the Data Manager.

Select the "361 Unique ISI Records" and run 'Data Preparation > Text Files > Extract Document Co-Citation Network.' The co-citation network will have 5,335 nodes (213 of which are isolates) and 193,039 edges. Isolates can be removed by running 'Preprocessing > Networks > Delete Isolates.' The resulting network has 5122 nodes and 193,039 edges – and is too dense for display in GUESS. Edges with low weights can be eliminated by running 'Preprocessing > Networks > Extract Edges Above or Below Value' with parameter values:
     Extract from this number: 4
     Below?: # leave unchecked
     Numeric Attribute: weight

Here, only edges with a local co-citation count of five or higher are kept. The giant component in the resulting network has 265 nodes and 1,607 edges. All other components have only one or two nodes.

The giant component can be visualized in GUESS, see Figure 5.15 (right); see the above explanation, and use the same size and color coding and labeling as the bibliographic coupling network. Simply run 'GUESS: File > Run Script ...' and select 'yoursci2directory/scripts/GUESS/reference-co-occurence-nw.py'

...

Text normalization utilizes the Standard Analyzer provided by Lucene (http://lucene.apache.org). It separates text into word tokens, normalizes word tokens to lower case, removes "s" from the end of words, removes dots from acronyms, deletes stop words, and applies the English Snowball stemmer (http://snowball.tartarus.org/algorithms/english/stemmer.html), which is a version of the Porter2 stemmer designed for the English language..

The result is a derived table – "with normalized Abstract" – in which the text in the abstract column is normalized. Select this table and run 'Data Preparation > Text Files > Extract Word Co-Occurrence Network' using parameters:

...

The outcome is a network in which nodes represent words and edges and denote their joint appearance in a paper. Word co-occurrence networks are rather large and dense. Running the 'Analysis > Networks > Network Analysis Toolkit (NAT)' reveals that the network has 2,821 word nodes and 242,385 co-occurrence edges.

There are 354 isolated nodes that can be removed by running 'Preprocessing > Networks > Delete Isolates' on the Co-Word Occurrence network. Note that when isolates are removed, papers without abstracts are removed along with the keywords.

The result is one giant component with 2,467 nodes and 242,385 edges. To visualize this rather large network, begin by running 'Visualization > Networks > DrL (VxOrd)' with default values:

...

Wiki Markup
Once edges have been removed, the network "top 1000 edges by weight" can be visualized by running _'Visualization > Networks > GUESS'_. In GUESS, run the following commands in the Interpreter:
     > for node in g.nodes:
     \[tab\]  node.x = node.xpos * 40
     \[tab\]  node.y = node.ypos * 40
     \[tab\]
      > resizeLinear(references, 2, 40)
     > colorize(references,\[200,200,200\],\[0,0,0\])
     > resizeLinear(weight, .1, 2)
     > g.edges.color = "127,193,65,255"

The result should look something like Figure 5.16. !worddav21215299c835958494eeca9997b2531d.png|height=326,width=465! *{_}Figure{_}* *{_}
Image Added
Figure 5.16: Undirected, weighted word co-occurrence network visualization for the DrL-processed 'FourNetSciResearchers' dataset{_}*

Currently, when you resize large networks in the GUESS visualization tool, the network visualizations can become "uncentered" in the display window. Running 'View > Center' does not solve this problem. Users should zoom out to find the visualization, center it, and then zoom back in.

Note that only the top 1000 edges (by weight) in this large network appear in the above visualization, creating the impression of isolate nodes. To remove nodes that are not connected by the top 1000 edges (by weight), run 'Preprocessing > Networks > Delete Isolates' on the "top 1000 edges by weight" network and visualize the result using the workflow described above.

...