Public data sets that NetKit has been used on

These are the benchmark data sets that were used in:
Sofus A. Macskassy , Foster Provost "Classification in Networked Data: A toolkit and a univariate case study," Journal of Machine Learning, 8(May):935-983, 2007. [pdf].

There are four data sets: CoRA, IMDb, Industry, WebKB.
References are at the end.

CoRA

This data set is based on the cora data set (McCallum et al., 2000), which comprises computer science research papers. It includes the full citation graph as well as labels for the topic of each paper (and potentially sub- and sub-subtopics). There are seven possible labels.

The file contains two data sets, one using only citation links and one using both citation and shared-author links. The edge weights are added: one per shared author and one for a citation (two if the papers cite each other).

NetKit data file: cora.zip (contains only labels + edges).
Original data:: cora-classify.tar.gz, which can be gotten from Andrew McCallum's Code+Data Page. using the "Cora research Paper Classification" link.

IMDb

This data set is based on data that was gotten from the Internet Movie Database. This data was built to re-create an earlier study to build models predicting movie success as determined by box-office receipts (Jensen and Neville, 2002). This data contains movies released in the United States between 1996 and 2001, with class labels identifying whether the opening weekend box-office receipts will exceed $2 million (Neville et al., 2003).

The file contains two types of graphs:

NetKit data file: imdb.zip (contains only labels + edges).

Industry

These two data sets contain companies that are linked via cooccurrence in text documents--as derived from two different data sets, representing different sources and distributions of documents and different time periods (which correspond to different topic distributions).

NetKit data file: industry.zip (contains only labels + edges).

WebKB

This data is based on the WebKB Project (Craven et al., 1998). It consists of sets of web pages from four computer science departments, with each page manually labeled into 7 categories: course, department, faculty, project, staff, student, or other. We do not include the 'other' pages in the graph, but use them to generate edges.

This data file contains eight different graphs (two per university). For each university, we have the graph using direct hyperlinks and another graph using co-citation links (if x links to z and y links to z, then x and y are co-citing z) (Chakrabarti et al., 1998; Lu and Getoor, 2003). To create co-citation edges, we do allow an 'other' page as an intermediary although the final graph does not include the 'other' pages. To weight the link between x and y, we sum the number of hyperlinks from x to z and separately the number from y to z, and multiply these two quantities.

NetKit data file: webkb.zip (contains only labels + edges).
Original data:: ilp-data.tar, which can be gotten from the WebKB project's ILP '98 Paper data page.

References