Subjectivity Dataset v1.0

Available as 'subjectivity dataset v1.0' from http://www.cs.cornell.edu/People/pabo/people/pabo/movie-review-data

Source:
http://www.cs.cornell.edu/people/pabo/movie-review-data/rotten_imdb.tar.gz

Attribution:
"subjectivity dataset v1.0 by Bo Pang
[https://sites.google.com/site/bopang42/] and
Lillian Lee [http://www.cs.cornell.edu/home/llee/] is licensed under CC BY 4.0
International [http://creativecommons.org/licenses/by/4.0/]"

NB: Line-endings have been converted from DOS to Unix. Some characters
have been converted from ISO-8859-1 to UTF-8. Occurences of NEL
Unicode character (\x85) were replaced by a single whitespace.

=======================================================

Introduction

This README v1.0 (June, 2004) for the v1.0 subjectivity dataset comes
from the URL
http://www.cs.cornell.edu/people/pabo/movie-review-data .

=======

Citation Info 

This data was first used in Bo Pang and Lillian Lee,
``A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization 
Based on Minimum Cuts'', Proceedings of the ACL, 2004.

@InProceedings{Pang+Lee:04a,
  author =       {Bo Pang and Lillian Lee},
  title =        {A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts},
  booktitle =    "Proceedings of the ACL",
  year =         2004
}

=======

Data Format Summary 

- rotten_imdb.tar.gz: contains this readme and two data files that
  were used in the experiments described in Pang/Lee ACL 2004.

  Specifically: 

  * quote.tok.gt9.5000 contains 5000 subjective sentences (or snippets);

  * plot.tok.gt9.5000 contains 5000 objective sentences.  

  Each line in these two files corresponds to a single sentence or
  snippet; all sentences (or snippets) are down-cased.  Only sentences
  or snippets containing at least 10 tokens were included.  The
  sentences and snippets were labeled automatically, as described
  below (see section "Label Decision").
  	 
- subjectivity_html.tar.gz: The original source files from which the
  extracted, processed, labeled, length-filtered, and (randomly)
  selected data in rotten_imdb.tar.gz was derived.

  Specifically:

  The starting points for data acquisition were snippets of movie
  reviews from Rotten Tomatoes (http://www.rottentomatoes.com/) and
  plot summaries for movies from the Internet Movie Database
  (http://www.imdb.com).  The html files under the folder "subj" are
  webpages containing Rotten Tomatoes snippets; the html files under
  the folder "obj" are pages containing IMDb plot summaries. The
  sentences/snippets in rotten_imdb.tar.gz represent a processed
  subset of these files.
   
=======

Label Decision 

We assumed that all snippets from the Rotten Tomatoes pages are
subjective, and all sentences from IMDb plot summaries are objective.
This is mostly true; but plot summaries can occasionally contain
subjective sentences that are mis-labeled as objective.
