Child pages
  • 5.1.4 Studying Four Major NetSci Researchers (ISI Data)

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

To extract the paper citation network, select the '361 Unique ISI Records' table and run 'Data Preparation > Text Files > Extract Directed Network' using the parameters :

The result is a directed network of paper citations in the Data Manager. Each paper node has two citation counts. The local citation count (LCC) indicates how often a paper was cited by papers in the set. The global citation count (GCC) equals the times cited (TC) value in the original ISI file. Only references from other ISI records count towards an ISI paper's GCC value. Currently, the Sci2 Tool sets the GCC of references to -1 (except for references that are not also ISI records) to prune the network to contain only the original ISI records.

...

  1. Resize Linear > Nodes > globalcitationcount> From: 1 To: 50 > When the nodes have no 'globalcitationcount': 0.1 > Do Resize Linear
  2. Colorize > Nodes > globalcitationcount > From:   To:   > When the nodes have no 'globalcitationcount': 0.1 >   >Do Colorize
  3. Colorize > Edges > weight > From (select the "RGB" tab) 127, 193, 65 To: (select the "RGB" tab) 0, 0, 0
  4. Wiki Markup
    Type in Interpreter:
         >for n in g.nodes:
         \[tab\] n.strokecolor = n.color
    Or, select the 'Interpreter' tab at the bottom, left-hand corner of the GUESS window, and enter the command lines:
         > resizeLinear(globalcitationcount,1,50)
         > colorize(globalcitationcount,gray,black)
         > for e in g.edges:
         \[tab\] e.color="127,193,65,255"
    Note: The Interpreter tab will have '>>>' as a prompt for these commands. It is not necessary to type '>" at the beginning of the line. You should type each line individually and press "Enter" to submit the commands to the Interpreter.
    This will result in nodes which are linearly sized and color coded by their GCC, connected by green directed edges, as shown in Figure 5.11 (left). Any numeric node attribute within the network can be used to code the nodes. To view the available attributes, mouse over a node. The GUESS interface supports pan and zoom, node selection, and details on demand. For more information, refer to the GUESS tutorial at&nbsp;<span style="color: #006daf">[</span><span style="color: #006daf"><a href="http://nwb.slis.indiana.edu/Docs/GettingStartedGUESSNWB.pdf" class="external-link" rel="nofollow">http://nwb.slis.indiana.edu/Docs/GettingStartedGUESSNWB.pdf</a></span>\|http://nwb.slis.indiana.edu/Docs/GettingStartedGUESSNWB.pdf\].

...

The complete network can be reduced to papers that appeared in the original ISI file by deleting all nodes that have a GCC of -1. Simply run 'Preprocessing > Networks > 

Wiki Markup
\_\[
Extract Nodes Above or Below Value|CISHELL:Extract Nodes Above or Below Value]_' with parameter values:

The resulting network is unconnected, i.e., it has many subnetworks many of which have only one node. These single unconnected nodes, also called isolates, can be removed using 'Preprocessing > Networks > 

Wiki Markup
\_\[
Delete Isolates|CISHELL:Delete Isolates]_'. Deleting isolates is a memory intensive procedure. If you experience problems at this step, refer to Section [3.3 4 Memory Allocation|SCI2TUTORIAL:3.3 Memory Allocation].

The 'FourNetSciResearchers' dataset has exactly 65 isolates. Removing those leaves 12 networks shown in Figure 5.11 (right) using the same color and size coding as in Figure 5.11 (left). Using 'View > Information Window' in GUESS reveals detailed information for any node or edge.
Alternatively, nodes could have been color and/or size coded by their degree using, e.g.:
     > g.computeDegrees()
     > colorize(outdegree,gray,black)

...

The complete paper-paper-citation network can be split into its subnetworks using 'Analysis > Networks > Unweighted & Directed > 

Wiki Markup
\_\[
Weak Component Clustering|CISHELL:Weak Component Clustering]_' with the default values:

...

To produce a co-authorship network in the Sci2 Tool, select the table of all 361 unique ISI records from the 'FourNetSciResearchers' dataset in the Data Manager window. Run 'Data Preparation > Text Files > Extract Co-Author Network' using the parameter:

The result is two derived files in the Data Manager window: the "Extracted Co-Authorship Network" and an "Author information" table (also known as a "merge table"), which lists unique authors. In order to manually examine and edit the list of unique authors, open the merge table in your default spreadsheet program. In the spreadsheet, select all records, including "label," "timesCited," "numberOfWorks," "uniqueIndex," and "combineValues," and sort by "label." Identify names that refer to the same person. In order to merge two names, first delete the asterisk ('*') in the "combineValues" column of the duplicate node's row. Then, copy the "uniqueIndex" of the name that should be kept and paste it into the cell of the name that should be deleted. Resave the revised table as a .csv file and reload it. Select both the merge table and the network and run 'Data Preparation > Text Files > 

Wiki Markup
\_\[
Update Network by Merging Nodes|CISHELL:Update Network by Merging Nodes]_'. Table 5.2 shows the result of merging "Albet, R" and "Albert, R": "Albet, R" will be deleted and all of the node linkages and citation counts will be added to "Albert, R".

...

A merge table can be automatically generated by applying the Jaro distance metric (Jaro, 1989, 1995) available in the open source Similarity Measure Library ([http://sourceforge.net/projects/simmetrics/|http://sourceforge.net/projects/simmetrics/]) to identify potential duplicates. In the Sci2 Tool, simply select the co-author network and run 'Data Preparation > Text Files > 

Wiki Markup
\_\[
Detect Duplicate Nodes|CISHELL:Detect Duplicate Nodes]_''. using the parameters:

The result is a merge table that has the very same format as Table 5.2, together with two textual log files:

...

In sum, unification of author names can be done manually or automatically, independently or in conjunction with other data manipulation. It is recommended that users create the initial merge table automatically and fine-tune it as needed. Note that the same procedure can be used to identify duplicate references – simply select a paper-citation network and run 'Data Preparation > Text Files > Detect Duplicate Nodes' using the same parameters as above and a merge table for references will be created.

To merge identified duplicate nodes, select both the "Extracted Co-Authorship Network" and "Merge Table: based on label" by holding down the 'Ctrl' key. Run 'Data Preparation > Text Files >

Wiki Markup
\_\[
Update Network by Merging Nodes|CISHELL:Update Network by Merging Nodes]_'. This will produce an updated network as well as a report describing which nodes were merged. To complete this workflow, an aggregation function file must also be selected from the pop-up window:

...

Running 'Analysis > Networks > Network Analysis Toolkit (NAT)' reveals that the network has 5,342 nodes (5,013 of which are isolate nodes) and 6,277 edges.

...


Isolate nodes can be removed running 'Preprocessing > Networks >

Wiki Markup
\_\[
Delete Isolates|CISHELL:Delete Isolates]_'. The resulting network has 242 nodes and 1,534 edges in 12 weakly connected components.

...

Select the "361 Unique ISI Records" and run 'Data Preparation > Text Files > Extract Document Co-Citation Network.' The co-citation network will have 5,335 nodes (213 of which are isolates) and 193,039 edges. Isolates can be removed by running 'Preprocessing > Networks > Delete Isolates.' The resulting network has 5122 nodes and 193,039 edges – and is too dense for display in GUESS. Edges with low weights can be eliminated by running 'Preprocessing > Networks > Extract Edges Above or Below Value' with parameter values:
     Extract from this number: 4
     Below?: # leave unchecked
     Numeric Attribute: weight

...

In the Sci2 Tool, select "361 unique ISI Records" from the 'FourNetSciResearchers' dataset in the Data Manager. Run 'Preprocessing > Topical > Normalize Lowercase, Tokenize, Stem, and Stopword Text' using the following parameters:

...

The result is a derived table – "with normalized Abstract" – in which the text in the abstract column is normalized. Select this table and run 'Data Preparation > Text Files > Extract Word Co-Occurrence Network'  using using parameters:

The outcome is a network in which nodes represent words and edges and denote their joint appearance in a paper. Word co-occurrence networks are rather large and dense. Running the 'Analysis > Networks > Network Analysis Toolkit (NAT)' reveals that the network has 2,821 word nodes and 242,385 co-occurrence edges.

...

The result is one giant component with 2,467 nodes and 242,385 edges. To visualize this rather large network, begin by running 'Visualization > Networks > 

Wiki Markup
\_\[
DrL (VxOrd)|CISHELL:DrL (VxOrd)]_' with default values:


Note that the DrL algorithm requires extensive data processing and requires a bit of time to run, even on powerful systems. See the console window for details:

...

To keep only the strongest edges in the "Laid out with DrL" network, run 'Preprocessing > Networks > 

Wiki Markup
\_\[
Extract Top Edges|CISHELL:Extract Top Edges]_' on the new network using the following parameters:

...

Once edges have been removed, the network "top 1000 edges by weight" can be visualized by running 'Visualization > Networks > GUESS'. In GUESS, run the following commands in the Interpreter:
     > for node in g.nodes:

Wiki Markup
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;\[tab\] node.x = node.xpos * 40

Wiki Markup
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;\[tab\] node.y = node.ypos * 40

Wiki Markup
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;\[tab\]

     > resizeLinear(references, 2, 40)
Wiki Markup
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;> colorize(references,\[200,200,200\],\[0,0,0\])

     > resizeLinear(weight, .1, 2)
     > g.edges.color = "127,193,65,255"

...

Note that only the top 1000 edges (by weight) in this large network appear in the above visualization, creating the impression of isolate nodes. To remove nodes that are not connected by the top 1000 edges (by weight), run 'Preprocessing > Networks > Delete Isolates' on the "top 1000 edges by weight" network and visualize the result using the workflow described above.

...

The Sci2 Tool supports the creation of databases from ISI files. Database loading improves the speed and functionality of data preparation and preprocessing. While the initial loading can take quite some time for larger datasets (see sections 

Wiki Markup
\+\[
3.3 4 Memory Allocation|SCI2TUTORIAL:3.3 Memory Allocation] and 
Wiki Markup
+\+\[+
3.4 5 Memory Limits|SCI2TUTORIAL:3.4 Memory Limits]) , it results in vastly faster and more powerful data processing and extraction.

Once again load 'yoursci2directory/sampledata/scientometrics/isi/FourNetSciResearchers.isi', this time using 'File > Load' instead of 'File > Load.' Choose 'ISI database' from the load window. Right-click to view the database schema:


Figure 5.17: The database schema as viewed in Notepad

As before, it is important to clean the database before running any extractions by merging and matching authors, journals, and references. Run 'Data Preparation > Database > ISI > Merge Identical ISI People', followed by 'Data Preparation > Database > ISI > Merge Document Sources' and 'Data Preparation > Database > ISI > Match References to Papers'. Make sure to wait until each cleaning step is complete before beginning the next one.


Figure 5.18: Cleaned database of 'FourNetSciResearchers'

Extracting different tables will provide different views of the data. Run 'Data Preparation > Database > ISI > Extract Authors' to view all the authors from FourNetSciResearchers.isi. The table includes the number of papers each person in the dataset authored, their Global Citation Count (how many times they have been cited according to ISI), and their Local Citation Count (how many times they were cited in the current dataset.)
The queries can also output data specifically tailored for the burst detection algorithm (see section 4.6.1 Burst Detection).
Run 'Data Preparation > Database > ISI > Extract References by Year for Burst Detection' on the cleaned "with references and papers matched" database, followed by 'Analysis > Topical > Burst Detection' with the following parameters:

Visualize the burst analysis with 'Visualization > Temporal > Horizontal Bar Graph' with the following parameters:

See section [2.4 Saving Visualizations for Publication|SCI2TUTORIAL:2.4 Saving Visualizations for Publication] to save and view the graph.

 
Figure 5.19: Top reference bursts in the 'FourNetSciResearchers' dataset

For temporal studies, it can be useful to aggregate data by year rather than by author, reference, etc. Running 'Data Preparation > Database > ISI > Extract Longitudinal Summary' will output a table which lists metrics for every year mentioned in the dataset. The longitudinal study table contains the volume of documents and references published per year, as well as the total number of references, the number of distinct references, distinct authors, distinct sources, and distinct keywords per year. The results are graphed in Figure 5.20 (the graph was created using Excel). 

...

The largest speed increases from the database functionality can be found in the extraction of networks. First, compare the results of a co-authorship extraction with those from section 5.1.4.2 Author Co-Occurrence (Co-Author) Network. Run 'Data Preparation > Database > ISI > Extract Co-Author Network' followed by 'Analysis > Networks > Network Analysis Toolkit (NAT)'. Notice that both networks have 247 nodes and 891 edges. Visualize the extracted co-author network in GUESS using 'Visualization > Networks > GUESS' and reformat the visualization using 'Layout > GEM' and 'Layout > Bin Pack.' To apply the default co-authorship theme, go to 'Script > Run Script' and find 'yoursci2directory/scripts/GUESS/co-author-nw_database.py'. The resulting network will look like Figure 5.21.

...

Using Sci2's database functionality allows for several network extractions that cannot be achieved with the text-based algorithms. For example, extracting journal co-citation networks reveals which journals are cited together most frequently. Run 'Data Preparation > Database > ISI > Extract Document Co-Citation Network (Core and References)' on the database to create a network of co-cited journals, and then prune it using 'Preprocessing > Networks > Extract Edges Above or Below Value' with the parameters:


Now remove isolates ('Preprocessing > Networks > Delete Isolates') and append node degree attributes to the network ('Analysis > Networks > Unweighted & Undirected > Node Degree'). The workflow in your Data Manager should look like this:
View the network in GUESS using 'Visualization > Networks > GUESS.' Use 'Layout > GEM' and 'Layout > Bin Pack' to reformat the visualization. Resize and color the edges to display the strongest and earliest co-citation links using the following parameters:

...