- We’ve released a new version of Data for Research!
- Try it now
- How do I contact you?
- Why do I need to register?
- What are those hashtags in my dataset?
- Can I view other people's queries and results?
- Does the search count include both metadata and text?
- How do I know when my request is completed?
- How do I search for articles?
- I created a query, but when I clicked submit, it asked me to log in, why is that?
- I need to make a small change to one of my queries, do I really need to start over from scratch?
- Is the searching multilingual?
- I have downloaded the result set, now what do I do with it?
- What are OCR errors and do I need to care about them?
- What is a CSV file?
- Why are the results limited, and how can I get the rest?
- How does DFR tokenize text for n-grams?
Questions and feedback related to the Data for Research site may be submitted from the Contact Us page. We are very interested in feedback as we look to improve existing functionality and add new features in the coming months. <back>
From a practical standpoint we need to know where to send notifications about generated datasets. The specification of a dataset is accomplished interactively via the DfR 'Explore' tool, but the creation of a dataset is performed as an off-line function on one of our back-end servers. The time to generate the dataset bundle can take anywhere from a minute or two to a number of hours depending on the size of the dataset and the number of jobs queued for the server. Once your dataset has been generated an email notification is sent and the dataset is made available for download from the DfR site. <back>
The hashtags are use as replacement characters for non-word tokens in n-grams.
|#||Indicates the beginning or end of a sentence|
|##||Indicates a number|
|###||Indicates punctuation (any non-alphanumeric character)|
No. All queries are 'private' and are only visible by its owner and site administrators. <back>
The counts returned by the queries represent the number of journal articles selected using the criteria specified in the 'Explore' tool. Depending on the criteria defined, this can include a very large number of articles. A dataset will consist of a sample of the articles selected in query if the count exceeds the basic dataset size (currently defined as 1,000 articles). Larger samples are available upon request. Details regarding the structure of the files and their contents are provided in a README file contained in the dataset. <back>
An email notification will be sent to the address specified during registration. The status of the entry in your 'My Data Requests' page will also be updated. Depending on the size of the requested dataset and the load on our back-end servers, a request can cake anywhere from a couple of minutes to a few hours to be completed. <back>
The 'Explore' tool provides a means to select articles using a combination of full-text searching and faceted filtering on selected metadata fields. Full-text searching supports both terms and phrases. The Explore interface permits the iterative selection and refinement of the query criteria with feedback provided with graphs depicting the distribution of articles across disciplines and by year of publication. <back>
The system is available for exploration by anyone, but will only accept a request from a recognized user. If you had not registered and/or logged in prior to generating the dataset query, the system will ask for a login before proceeding with the request. <back>
During the generation of the query one is able to iteratively refine the query criteria. However, once a query has been submitted for processing, it can no longer be edited. As of beta v2 you can now retrieve an earlier query from the 'My Data Requests' page and resubmit and/or refine the earlier request. <back>
No. While some significant non-english content is present in the JSTOR archive, the search index is primarily constructed from ASCII encoded text. Some specific metadata fields (such as titles and author names) do contain unicode for non-english text, but fielded searching for this content is not supported by the current version of the DfR search engine. <back>
The result set consists of multiple files bundled as a zip archive file. The first thing that must be done to use the dataset is to unzip this bundle. Most operating system provide tools for accomplishing this. The individual files contained in the dataset are structured as CSV (comma-separated values) files and are thus able to be directly imported by most spreadsheet programs. The CSV files also lend themselves to relatively easy parsing and processing by automated anaysis tools. <back>
The vast majority of the content in the JSTOR archive is a digital representation of a document that was originally produced as printed publication. An OCR (optical character recognition) process is used to generate text from the digitized print page images. While modern OCR programs are generally quite good and on average can often produce accuracy rates of 99% or better, the accuracy for a individual document can be affected by many factors, including the physical condition of the document. Since the search index and the database used for the word counts are derived from the OCR content, expect to find some nonsensical words in the generated datasets. More information on JSTOR's digitization process can be found on the JSTOR main site. <back>
A CSV, or comma-separated file, is simply a text file in which the specific values contained in each row are separated by a delimiter character, typically a comma, although other characters (such as a tab) can sometimes be used. All CSV files generated by the DfR site use the comma character as a field delimiter. More information on CSV files can be found on Wikipedia. <back>
The online DfR tool provides a convenient means for exploring the JSTOR archive and obtaining sample datasets. If the sample size was unlimited we'd quickly run into resource and load problems. The JSTOR archive contains nearly 5 million articles. Generating datasets at high volume is a resource intensive process. The idea is that the sample would provide a means for assessing the suitability of the data for the intended research project. If a larger dataset is desired, a request should be submitted via the contact form defining the dataset size needed along with a description of the project. <back>
DFR uses the nltk library to tokenize text before computing n-grams. If you are interested in the precise technique we currently use to generate n-grams, see this example file which uses the same logic as DFR to generate tokens.
On December 18, 2014 we fixed a bug affecting n-gram generation (wordcounts, bigrams, trigrams, and quadgrams). When tokenizing the raw text of documents for n-gram generation, DFR was mishandling certain types of tokens. In some situations, tokens containing non-alpha characters (for example "quoted" words, or apostrophe'd words) were being treated as a punctuation mark instead of a word (such tokens would appear in n-grams as "###". To fix this issue we've improved the way we split sentences into tokens before generating n-grams. Datasets generated before December 18, 2014 could be affected by this issue. Please contact us if you have any questions. If you'd like to get an updated dataset with the newer, more accurate tokenization code, simply re-submit your dataset request. <back>
JSTOR is part of ITHAKA, a not-for-profit organization helping the academic community use digital technologies to
preserve the scholarly record and to advance research and teaching in sustainable ways.
©2000-2010 ITHAKA. All Rights Reserved. JSTOR®, the JSTOR logo, and ITHAKA® are registered trademarks of ITHAKA.