Data For Research

About Data for Research

About Data for Research (DfR):

Data for Research is a free service for researchers wishing to analyze content on JSTOR through a variety of lenses and perspectives. DfR enables researchers to find useful patterns, associations and unforeseen relationships in the body of research available in the journal and pamphlet archives on JSTOR. To this end we provide data sets of documents to researchers: OCR, metadata, Key Terms, N-grams and reference text.

JSTOR incorporates metadata and OCR text from many different sources. We work to mitigate any inconsistencies, but you may come across issues with datasets, such as duplicate OCR, missing OCR, or incomplete metadata. Please contact us if you encounter data issues.

Using the site:

Through a faceted interface, researchers can explore the entire JSTOR archive, quickly and easily defining content of interest through an iterative process of searching and results filtering. By creating a DfR account, you can download the metadata, word frequencies, citations, key terms, and N-grams of up to 1,000 documents. The interface operates on Boolean logic but should you find your query especially difficult to build, we’re happy to help.

How to get larger data sets:

If you require more than 1,000 documents or a type of data not available through the interactive portion of the site, please contact us at: We will provide you with a questionnaire where you can more fully describe the nature of your work, its participants, and the requested type and amount of data..If we are able to approve your request for the provision of the data as described, we will present you with our Data for Research agreement for your review and signature.

In general, we can provide the following:

  • Per-Document N-Grams:

    An N-Gram is a contiguous sequence of n tokens from a given sequence of text. The items will be space-separated tokens. We can provide 1, 2, 3, and 4 grams.
  • OCR/Full Text:

    Full text of documents. Depending on the journal and issue, this may be either text extracted from page scans via OCR or the full text as submitted digitally by publishers. We do not distinguish between OCR and digitally-sourced full text in DFR datasets.
  • Key Terms:

    Per-document key words automatically extracted from documents using TF*IDF term weighting. Please note this does not include author or publisher-assigned key terms.
  • Reference text:

    Extracted snippets of text for each reference in a document. May be sourced from OCR or digital metadata depending on whether a document is sourced from page scans or some digital media. If a reference is known by JSTOR to reference another JSTOR article, the DOI for the referenced article will be included with the reference. Most references do not have such links.


JSTOR is a digital library of academic journals, books, and primary sources. JSTOR helps people discover, use, and build upon a wide range of scholarly knowledge through a powerful research and teaching platform, and preserves this content for future generations. JSTOR is part of ITHAKA, a not-for-profit organization that also includes Ithaka S+R and Portico. Learn more about JSTOR.

Data for Research has been developed by JSTOR's Advanced Technology (AT) group. The Advanced Technology Group is dedicated to discovering and using relevant technologies in support of JSTOR and the broader scholarly community.

Selected: 9,290,921

JSTOR is part of ITHAKA, a not-for-profit organization helping the academic community use digital technologies to preserve the scholarly record and to advance research and teaching in sustainable ways.
©2000-2010 ITHAKA. All Rights Reserved. JSTOR®, the JSTOR logo, and ITHAKA® are registered trademarks of ITHAKA.