RCT Citation Network - Kevin Makice

Home » Projects » RCT Citation Network

Reveal the impact

This poster design is the most recent iteration of a visual description of citation network data for RCT scholarly publications.

2022

Relational Cultural Theory (RCT) is a body of work that arose out of feminist and multicultural critique of psychology in the 1970s and 1980s. At the start of this area of study, psychologist Jean Baker Miller wrote a foundational book in 1976, Toward a New Psychology of Women, which examined the inadequacies of traditional psychology in the context of women’s experiences. Together with several other core authors, Miller developed a body of collaborative work that persists and thrives today.

This visualization project prepared a poster that provides a comprehensive summary of the RCT published knowledge, using a citation network for works stemming from Miller’s foundational book. The three key questions being answers were:

What body of work has Towards a New Psychology of Women inspired?
Which works are most critical to the RCT community of scholars?
Which topics are most dominant in RCT?

Network data was compiled through extensive data mining of works using the OpenAlex project, a catalog of global research, resulting in a relevant network of 1,289,418 works and 3,182,441 references and citations. This includes 98 anchoring works authored by seven founding and continuing scholars understood to be at the center of RCT—Jean Baker Miller, Irene P. Stiver, Janet Surrey, Judith Jordan, Maureen Walker, Amy Banks, and Harriet Schwartz.

Key aspects of the data analysis that followed were pruning data into a usable size and community classification, work that fueled visual design of the poster.

Pruning the Network
Identifying Community
Exploration
Final Visualization

Reduce the network to a manageable size that remains relevant to RCT.

Pruning the Network

In an effort to keep the network growth directed during data mining while still allowing for surprises deep in the network, two requirements were enforced. First, works in the periphery of the network were explored further only if they cited a sufficient number of anchor works published by the seven core authors. Second, the further distant in the network from Miller’s foundational book, the more connected it had to be to the other anchor works.

This mitigated the effects of casual references that were really unrelated to RCT, but much of the girth of the citation network that remained was due to works that are peripheral to Toward a New Psychology of Women and the central body of RCT work stemming it. The first task was to reduce its massive size in a meaningful way.

Core Works Valuation

The lowest-hanging fruit were the works on the fringe of the core network. These works were identified by examining the core value metric, which counts the number of connections to contributing works (i.e., works that added their own citations and references during data mining). Core value benefited contributing works most, but this measure also allowed works that were only added as references to score well by virtue of how much they were cited by contributing work. The minimum connectivity needed was 7, reflecting the furthest possible path from the foundational work.

This first pruning action represented a reduction of 93.43 percent of the works in the network and 67.03 percent of all edges. This also reduced the number of anchor works by core RCT authors from 98 to 53 works.

Core Works Valuation

This graph shows the distribution of nodes based on their core value scores that measure how much they are connected to contributing works in the initial RCT citation network. The red line is placed at 5,000 nodes.

Ego Network Appearances

A method to further reduce the network centered around overlapping subgraphs formed by the ego networks (i.e., nodes connected directly to a single starting node) of the remaining anchor works. For each ego network, all edges were ranked by betweenness centrality, an approach suggested
by Girvan-Newman. The top edge is removed, and the network is checked to see if the number
of separate components has changed.

Since this method recalculates the centrality measures at every step—a very expensive investment in processing for a network this size—the cost was mitigated by only calculating it once and relying on that initial rank ordering of edges to suggest which ones to remove, or Girvan-Newman Approximation. My prior work examining this approximation demonstrated that the significant gains in processing time did not detract from the quality of the results.

In this way, each ego network was deconstructed until all that remained was the largest component that still contained all of the relevant anchor works that were in the initial ego network. Those ego networks were then layered on top of each other to form a heat map and score each node by the number of connections to the core works and discarding those that did not reach a minimum threshold of 2.

This second pruning action resulted in a citation network of 79,215 nodes and 949,242 edges, a reduction of 6.53 percent of the works in the network and 9.51 percent of all edges.

Ego Network Appearances

This graph shows the distribution of nodes based on the number of times a work appeared in an ego network of an anchor work by a core RCT author. Any works with fewer than two appearances were removed from the network. The red line is placed at 5,000 nodes.

In final analysis, only 406 of the total 2224 citations to Miller’s book were of consequence to RCT.

Make sense of the nature of citation clusters by examining the network structure.

Identifying Community

Community analysis was conducted on the pruned citation network using a few candidate methods—Greedy Modularity, Asynchronous Fluid Communities, and Kernighan-Lin Bisection—that were selected based on their relative success in earlier tests out of a larger group of classification strategies that included Central Connection Value, Link Community, and K-Clique.

The RCT citation network is separated into 8 communities. This graph depicts where the works by core authors are located.

Greedy Modularity

The greedy modularity approach builds up the network one link at a time by placing
every node in its own community. With each step, this agglomeration rejoins a pair of communities together with the aim of maximizing its modularity, or relative density. Once no further increases are possible, the process ends. In this network, greedy modularity produced 8 communities, with four of them containing at least one core work. The best candidate for RCT contained 46 core author works and was the largest of the eight communities.

Asynchronous Fluid Communities

Inspired by fluid dynamics, asynchronous fluid communities analysis initially asks for a target number of communities and starts from a random node. While maintaining a stable density, nodes will change how they are linked until all nodes stop changing communities. This analysis was almost as good as greedy modularity in isolating the RCT community. One community (#4) gathered the lion’s share of core author works, with better balance between communities.

Kernighan-Lin Bisection

The Kernighan-Lin algorithm, originally used to optimize the layout of digital circuits, was adapted for network science as a means to separate a network into two large components, iterating until the
sum of their weight is minimized. It accomplishes this by rewiring pairs of nodes and adjusting their community membership until there is balance. Here, the bisection strategy was used to isolate components based on membership of Miller’s foundational work, discarding what remained and iterating until Miller is in a community all alone. The KLB score for this network, then, is the level a work survived before being discarded.

Because of this tactic, the resulting components may not make for optimal communities, but it is a good way to evaluate communities based on how many anchor works are members. The first separation identified 39,607 works as less relevant to RCT (i.e., all anchor works were in the other component). 18 bisections were made, with splits 4 and 8 representing the biggest RCT communities.

Partition Evaluation

In addition to the subjective evaluation of the communities through visualization and metrics, there were statistical evaluations that were done on partitions with community classifications. Two metrics often used are coverage and performance:

A partition’s coverage is calculated as the ratio of intra-community edges to the total edges in the graph. In other words, what percentage of network edges connect nodes from the same community. The higher this value, the better defined the boundaries between communities and the more inclusive the classifications become.
The performance of a partition looks at how well nodes stay within their communities. That is, performance not only looks at links between nodes of the same community but also at the number of possible edges to nodes in other communities that are missing. The sum of that number is divide by all of the potential edges in the graph to evaluate how well the classifications did.

When considering coverage and performance metrics, Fluid Communities method appeared to be the best to identify communities.

Based on this evaluation, asynchronous fluid communities was chosen as the best distribution of community membership. This method also produced three communities that were good candidates for further pruning, given none of the surviving anchor works by core RCT authors were members. At this point, the network of interest now contained 41,637 nodes and 432,316 edges, with 5,406 contributing works and still preserving 51 works by core authors.

Distance was also used to further prune, looking specifically at the average distance to Miller’s foundational book. Any works that could not calculate that connection to Miller’s book (e.g., had no neighbors with distance calculations) were removed from the network. The network had now shrunk to 35,049 nodes and 343,419 edges, with 5,398 contributing works and just 36 works by core authors remaining.

Continued evaluation then led to a decision to focus on the one (community 4) containing the bulk of the core author works. 32 of the remaining 36 core works were in that community, among 1,773 contributing works. Since the removal of works also removed connections to other communities, 15 additional works became isolates and were also removed from the network. This represented a significant reduction of 81.66 percent of the works in the network and 85.99 percent of all edges.

Final Pruning

With a smaller network now just 0.5 percent of the initial network from data mining, the asynchronous fluid community analysis was performed a second time, identifying 7 sub-communities to use in the visual design. These were analyzed in combination with with other network metrics of degree, authority, and clustering coefficient to understand some of properties of each community. I also compared the classification to two derived metrics, the Kernighan-Lin bisection level and the number of appearances in an ego network.

In several metrics, a pattern quickly emerged that separated two of the communities (2 and 5) from the others. For both Authority and the appearances in the ego networks, the node sizes displayed in the network graph are markedly more pronounced in sub-communities 2 and 5. Similarly, they showed a marked difference in their bisection levels, and to a lesser extent their degree. As communities least connected to the core RCT authors, and Miller’s book in particular, both sub-communities were ignored for the visualization and five remained.

The removal of these two sub-communities resulted in a final network of 3,408 nodes and 19,227 edges. 1,327 contributing works remained, including 32 of the original 98 core works identified. In the end, the full citation network had been pruned to 0.26 percent of its works and 0.60 percent of edges.

Select the data to include and explore possible shapes it can take.

Exploration

Publication data was mined from OpenAlex for all remaining contributing works, including the standardized topic concepts that were already scored for each work. The final network contained 1,565 concepts.

Psychology was the top concept for all five sub-communities, but the importance of other topics deviated from there. The next four topics—Domestic Violence, Medicine, Sociology, and HIV—offered pivot points that separated the nature of study for four of the five communities. Sub-community 4 was finally differentiated from the main RCT sub-community 3 by Qualitative Research, ranked 34th in the overall list. Other topics included in the top 25 concepts included: Clinical psychology, Social psychology, Gender studies, and Feminism … all known topics of this field of study.

Inspiration

For visual inspiration, I gathered a number of examples of visual information design that had answered similar questions or presented information in a relevant manner. In particular, I wanted my visualization to:

communicate a chronological order to publications
show relative importance of works within the greater RCT community
identify communities of related work and topic areas of concern

This collection of existing visualizations provided some ideas for how to address my information criteria:

Network visualizations like the ones typical of Gephi can provide a high-level gestalt about the shape of data and show any large separations in natural clusters. However, it is more difficult to show chronology—a key attribute for the RCT data—or control the display outside of styling nodes and edges. Though limited, starting with a standard network display is a crucial part of the exploration process.

For chronology, I loved the simplicity of the spiral in the Progressive Era History and the fact that it put emphasis on a sequence rather than the gaps in between works. The Citeology citation map offered a clear view of a conference history, both by volume and influence. The most appealing of the three, though, was the use of images and description with a timeline to explain the breadth of Pablo Picasso’s art.

For topics, the MLA format visualization was appealing for its tiered concepts, allowing someone to start broad and dive into detail. Likewise, the ribbon visualization for the Plastic Waste Pollution suggests some possibility to show how communities individually evolved over time, as well as a way to guide the view to detail about each community. I also included the Washington Metro subway map as a reminder that comprehension should be a guiding principle, and sometimes abstractions can leverage tacit understanding to communicate effectively.

Explain the nature of the network in a visual presentation of found insights.

Final Visualization

With all of these ideas in mind, I proceeded to sketch visualization concepts for RCT works. The sketching process is a way to work through ideas or problems I was seeing in the data, trying to let the process adapt as new information came in or constraints loomed. Depth of content was an early
sacrifice, as hopes of bringing in author information or evaluating concepts through analysis of abstracts had to be set aside.

Once it moved from sketch to code, though, I only had time to explore two concepts: a chronological scatterplot, and radial chronology.

Chronology Scatterplot

The initial attempt to shape two my two desired attributes—publication date and distance factor—into a simple scatterplot was successful. A low-resolution proof put both values in expected places in a box, allowing me some confidence to start to style and explore the data through iteration. The scatterplot was styled to leverage shapes and colors to represent different communities, highlighting the works by core authors in purple.

Occlusion, however, was an issue when incorporating the network distance from Miller’s book. Also, it became clear that OpenAlex was missing some publication dates, and some of the dates they had were all falling in clusters at the first of the year or month, suggesting some data integrity issues.

One insight that arose was an understanding that the distance factor—how far a work was from the core work—were insufficient to prevent occlusion. This visualization used a larger version of the final network, showing some need for further pruning as well as revealing that time, distance, and community membership are not clearly related. Most importantly, the scatterplot was not telling an interesting story.

Radial Chronology

Taking a cue from Jason Portenoy and Jevin D. West’s “Progressive era American Populism“, the
RCT data was presented as a spiral. All of the publication dates were sorted chronologically, starting with the foundational work in the center. The chronological gap between a work and Miller’s book spread the citations further away. With communities color-coded, the result is a colorful spiral to represent the breadth of work across all five sub-communities.

There was some promise here. In several places, a string of works from the same community created patterns that suggested a sudden burst of academic activity. This happened across all of the communities at some point. There were also larger shapes closer to center, which indicates that those works have more citations and references. The spiral seemed good at highlighting sequential patterns, but it did so without the context of time.

The spiral was replace with a radial timeline, which put 1975 at its center, New Year’s Day at the top, and then altered the angle and radius to match the year, month and day of publication. The occlusion that remained in the clusters of same-day publications was mitigated by randomly jittering works from the calculated center, so they could share space. To help identify the nameless works on the chronology map, several works were annotated in the margins by drawing horizontal lines, highlighting the anchor works on one side and the most connected other works on the other.

Final Visualization

Each of five communities are identified by color, with the size of the shape indicating greater degree (citations and references) in relation to other works. Annotations on either side indicate where the core author works and most influential works are located, showing their title and year of publication. Finally, the relationship between the five communities are showing with an inset connecting their shared and divergent key topics.