Resources & Assets
What We Do
Resources & Assets
The Cline Center draws from distinctive resources and assets in supporting its research, education, and collaboration opportunities.
The Cline Center’s research draws on two types of data resources: extreme-scale collections of unstructured textual data created by news content providers around the world, and structured datasets created by the Cline Center that support a wide range of social science research on human flourishing and societal well-being. The following illustrates just a few of the data assets curated or supported by the Cline Center.
The Cline Center’s text analytics research draws on over 100 million historical news reports from around the world including:
- Over 70 million unduplicated news stories collected with Cline Center’s Voyager web crawl system from thousands global news websites between 2006 to the present, with new stories added daily;
- The entire population of newspaper stories published between 1945 and 2005 by the New York Times and Wall Street Journal;
- Open-source information encompassing over 4 million news stories from BBC Monitoring’s Summary of World Broadcasts (SWB) from 1979 to the present and over 3 million news stories captured by the U.S. Government’s Foreign Broadcast Information Service (FBIS) from 1994 to 2003. News data from these sources consist of stories from every country in the world that have been translated into English by fluent speakers who are culturally resonant with the countries in which news items originally appeared.
- Another 6.2 million scanned and digitized microfilm and microfiche records of SWB and FBIS content captured from the 1940s to the 1990s.
- The only digitized record in the world of story summaries for four of the five main American newsreels that formed the hub of a worldwide newsreel system serving a global audience with visual news items nearly seven decades before the advent of CNN. The Cline Center also holds story summaries for one of the leading British newsreel companies along with a unique set of early television news summaries. Together, these various sources include around 130,000 stories broadcast between 1915 and 1985.
The Cline Center curates meta-data and extracted features for all of the items in the Global News Archive including the following information:
- Lists of organizations, places, and people mentioned
- Geo-parsed locations mentioned with associated latitude/longitude coordinates
- Several measures of sentiment
All of this information is grouped within an Apache Solr index which allows users to query and subset material for future analysis.
The Cline Center Historical Phoenix Event Data cover the period 1945-2015 and include several million CAMEO-labeled events extracted from approximately 14 million news stories. These data were produced using state-of-the-art PETRARCH-2 software to analyze content from the New York Times (1945-2005), the BBC Monitoring's Summary of World Broadcasts (1979-2015) and the CIA’s Foreign Broadcast Information Service (1995-2004). These data document the agents, locations, and issues at stake in a wide variety of conflict, cooperation and communicative events using the CAMEO ontology framework. The links below provide access to the event datasets, as well as meta-data for the news sources used to produce the events.
The Social, Political, and Economic Event Database (SPEED) project is a technology-intensive effort to extract event data from a global archive of news reports covering the period from 1945 to the present. SPEED is a monitoring system for detecting small-scale conflict process events using Artificial Intelligence algorithms and hybrid systems that immerse highly-trained human analysts in customized software environments. SPEED's "human in the loop" technology collects data on a full range of civil unrest, political violence, and state repression events at scales approaching those of fully-automated systems but with informational richness and data quality that is far superior to fully automated systems.
The Cline Center's Rule of Law project compiles historical data on legal periodicals (1773-present) and legal education programs (1100-present) to assess country-level legal infrastructures and measure the extent to which nations have institutionalized law-based social orders.
The Comparative Constitutions Project uses novel information technologies to collect, organize, and cross-validate comprehensive data about the world's constitutions. Currently a stand-alone organization, the Comparative Constitutions Project was originally incubated at the Cline Center and continues to be supported by the Cline Center.
The Coup D’état Project contains attempted and accomplished coups for all countries with more than 500,000 people between 1945 and 2005. The project has thus far identified over 1,000 coup events, which to our knowledge makes it the most complete listing of coups, attempted coups, and coup conspiracies available anywhere in the world.
The Composition of Religious and Ethnic Group (CREG) Project contains trend data from 1945 to 2013 on population percentages for major religious and ethnic groups in all countries with more than 500,000 people. This is the most systematic and wide-ranging inventory of socio-cultural groups yet collected.
Realizing the potential of these data assets for our campus community and beyond requires developing analytics tools and easily-accessible points of entry for data science, social science, and humanities researchers to make innovative discoveries. The Cline Center aims to lower the opportunity costs facing social science and humanities researchers interested in text analytics by developing software resources that place the power of text-mining analytics into the hands of researchers without requiring them to have advanced degrees in computer science.
All of the Cline Center’s extracted feature datasets have been designed for easy access by R users, and most other data resources have been structured with R integration in mind.
The Cline Center offers a variety of full-text searching capabilities so that researchers can quickly identify and refine work sets within the center’s Global News Archive.
The Cline Center’s Scout software system offers seamless integration with the Global News Archive for supporting human analysis tasks and algorithm development. Developed in collaboration with the National Center for Supercomputing Applications, Scout allows analysts to query, annotate, and visualize news content in a secure but flexible cyberenvironment that also stores annotations as structured datasets, assigns tasks to individual analysts, tests for inter-analyst reliability, and manages the activities of multiple analysts simultaneously.
Voyager is an extensible data ingestion pipeline for the Scout system that is fed by a near real-time RSS web crawler. As soon as new RSS feed elements are pushed out by monitored content providers around the world, Voyager immediately processes each news story using Natural Language Processing algorithms, geocodes all place names, and computes relevance scores using machine learning algorithms before storing the news text and associated features for later analysis.
Supporting the varied needs of Cline Center research projects is a combination of dedicated computational assets and additional computational resources available to the University of Illinois community. Ranging from data science workstations to dedicated physical servers and access to some of the world’s most powerful supercomputers, the Cline Center is well positioned to support the computational needs of extreme-scale text analytics research at the leading edge of innovation and discovery.
The Cline Center maintains a set of workstations on premises that are specifically configured to support text analytics applications and to be easily integrated with the center’s data assets and cyberinfrastructure.
The Cline Center operates a dedicated collection of physical servers configured to support the specific computational and cybersecurity demands of it extreme-scale text analytics research projects.
The Cline Center maintains a set of virtualized physical servers to support a wide range of applications for shorter- or longer-duration projects.
The Cline Center has the capability to support remote access to its computational resources by authorized collaborators operating within a secure environment.