Public data sets that NetKit has been used on
These are the benchmark data sets that were used in:Sofus A. Macskassy , Foster Provost "Classification in Networked Data: A toolkit and a univariate case study," Journal of Machine Learning, 8(May):935-983, 2007. [pdf].
There are four data sets: CoRA, IMDb, Industry, WebKB.
References are at the end.
CoRA
This data set is based on the cora data set (McCallum et al., 2000), which comprises computer science research papers. It includes the full citation graph as well as labels for the topic of each paper (and potentially sub- and sub-subtopics). There are seven possible labels.
The file contains two data sets, one using only citation links and one using both citation and shared-author links. The edge weights are added: one per shared author and one for a citation (two if the papers cite each other).
NetKit data file: cora.zip (contains only labels + edges).
Original data:: cora-classify.tar.gz, which can be gotten from Andrew McCallum's Code+Data Page. using the "Cora research Paper Classification" link.
IMDb
This data set is based on data that was gotten from the Internet Movie Database. This data was built to re-create an earlier study to build models predicting movie success as determined by box-office receipts (Jensen and Neville, 2002). This data contains movies released in the United States between 1996 and 2001, with class labels identifying whether the opening weekend box-office receipts will exceed $2 million (Neville et al., 2003).
The file contains two types of graphs:
- One, linking movies if they share a production company (based on observations from previous work (Macskassy and Provost, 2003), as well as based on input from David Jensen. The weight of an edge in the resulting graph is the number of production companies two movies have in common.
- Two, linking movies if they share a production company, producer, director, or actor. The weight of an edge is the number of such entities two movies have in common.
NetKit data file: imdb.zip (contains only labels + edges).
Industry
These two data sets contain companies that are linked via cooccurrence in text documents--as derived from two different data sets, representing different sources and distributions of documents and different time periods (which correspond to different topic distributions).
- industry-yh: As part of a study of activity monitoring, Fawcett and Provost (1999) collected 22170 business news stories from the web between 4/1/1999 and 8/4/1999. An edge was created between two companies if they appeared together in a story. The weight of an edge is the number of such cooccurrences found in the complete corpus. The resulting network comprises 1798 companies that cooccurred with at least one other company. The labels of the companies are based on Yahoo!'s 12 industry sectors.
- industry-pr: This data set is based on 35318 PR Newswire press releases gathered from April 1, 2003 through September 30, 2003. The companies mentioned in each press release were extracted and an edge was placed between two companies if they appeared together in a press release. The weight of an edge is the number of such cooccurrences found in the complete corpus. The resulting network comprises 2189 companies that cooccurred with at least one other company. The labels of the companies are based on Yahoo!'s 12 industry sectors.
NetKit data file: industry.zip (contains only labels + edges).
WebKB
This data is based on the WebKB Project (Craven et al., 1998). It consists of sets of web pages from four computer science departments, with each page manually labeled into 7 categories: course, department, faculty, project, staff, student, or other. We do not include the 'other' pages in the graph, but use them to generate edges.
This data file contains eight different graphs (two per university). For each university, we have the graph using direct hyperlinks and another graph using co-citation links (if x links to z and y links to z, then x and y are co-citing z) (Chakrabarti et al., 1998; Lu and Getoor, 2003). To create co-citation edges, we do allow an 'other' page as an intermediary although the final graph does not include the 'other' pages. To weight the link between x and y, we sum the number of hyperlinks from x to z and separately the number from y to z, and multiply these two quantities.
NetKit data file: webkb.zip (contains only labels + edges).
Original data:: ilp-data.tar, which can be gotten from the WebKB project's ILP '98 Paper data page.
References
- S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, pages 307-319, 1998.
- M. Craven, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and C. Y. Quek. Learning to extract symbolic knowledge from the World Wide Web. In Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI), pages 509-516, 1998.
- T. Fawcett and F. Provost. Activity monitoring: Noticing interesting changes in behavior. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 53-62, 1999.
- D. Jensen and J. Neville. Data mining in social networks. In National Academy of Sciences Symposium on Dynamic Social Network Modeling and Analysis, 2002.
- Q. Lu and L. Getoor. Link-based classification. In Proceedings of the Twentieth International Conference on Machine Learning (ICML), pages 496-503, 2003.
- S. A. Macskassy and F. Provost. A simple relational classifier. In Proceedings of the Multi- Relational Data MiningWorkshop (MRDM) at the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003.
- A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the construction of internet portals with machine learning. Information Retrieval, 3(2):127-163, 2000.
- J. Neville, D. Jensen, L. Friedland, and M. Hay. Learning relational probability trees. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 625-630, 2003.