Text classification with penalized logistic regression

Machine Learning
R
Train a classification model with tidymodels to distinguish Charlotte Brontë from Emily Brontë
Author
Published

May 2, 2020

This article demonstrates a modeling example using the tidymodels framework for text classification. Data are downloaded via the gutenbergr package, including 5 books written by either Emily Brontë or Charlotte Brontë. The goal is to predict the author given words in a line, that is the probability of line being written by one sister instead of another.

The cleaned books dataset contains linse as individual rows.

mirror_url <- "http://mirrors.xmission.com/gutenberg/"

books <- gutenberg_works(author %in% c("Brontë, Emily", "Brontë, Charlotte")) %>%
  gutenberg_download(meta_fields = c("title", "author"), mirror = mirror_url) %>%
  transmute(title,
            author = if_else(author == "Brontë, Emily",
                             "Emily Brontë",
                             "Charlotte Brontë") %>% factor(),
            line_index = row_number(),
            text)
books
#> # A tibble: 89,588 × 4
#>    title             author       line_index text               
#>    <chr>             <fct>             <int> <chr>              
#>  1 Wuthering Heights Emily Brontë          1 "Wuthering Heights"
#>  2 Wuthering Heights Emily Brontë          2 ""                 
#>  3 Wuthering Heights Emily Brontë          3 "by Emily Brontë"  
#>  4 Wuthering Heights Emily Brontë          4 ""                 
#>  5 Wuthering Heights Emily Brontë          5 ""                 
#>  6 Wuthering Heights Emily Brontë          6 ""                 
#>  7 Wuthering Heights Emily Brontë          7 ""                 
#>  8 Wuthering Heights Emily Brontë          8 "CHAPTER I"        
#>  9 Wuthering Heights Emily Brontë          9 ""                 
#> 10 Wuthering Heights Emily Brontë         10 ""                 
#> # … with 89,578 more rows
#> # ℹ Use `print(n = ...)` to see more rows

To obtain tidy text structure illustrated in Text Mining with R, I use unnest_tokens() to perform tokenization and remove all the stop words. I also removed characters like ', 's, ' and whitespaces to return valid column names after widening. But it turns out this served as some sort of stemming too! (heathcliff’s becomes heathcliff). Then low frequency words (whose frequency is less than 0.05% of an author’s total word counts) are removed. The cutoff may be a little too high if you plot that histogram, but I really need this to save computation efforts on my laptop :sweat_smile:.

clean_books <- books %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words) %>%
  filter(!str_detect(word, "^\\d+$")) %>%
  mutate(word = str_remove_all(word, "_|'s|'|\\s"))

total_words <- clean_books %>%
  count(author, name = "total")

tidy_books <- clean_books %>%
  left_join(total_words) %>%
  group_by(author, total, word) %>%
  filter((n() / total) > 0.0005) %>%
  ungroup()

Comparing word frequency

Before building an actual predictive model, let’s do some EDA to see different tendency to use a particular word! This will also shed light on what we would expect from the text classification. Now, we will compare word frequency (proportion) between the two sisters.

tidy_books %>%
  group_by(author, total) %>%
  count(word) %>%
  mutate(prop = n / total) %>%
  ungroup() %>%
  select(-total, -n) %>%
  pivot_wider(names_from  = author, values_from = prop,
              values_fill = list(prop = 0)) %>%
  ggplot(aes(x = `Charlotte Brontë`, y = `Emily Brontë`,
             color = abs(`Emily Brontë` -  `Charlotte Brontë`))) +
  geom_jitter(width = 0.001, height = 0.001, alpha = 0.2, size = 2.5) +
  geom_abline(color = "gray40", lty = 2) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5, size = 7.5) +
  scale_color_gradient(low = "darkslategray4", high = "gray75") +
  scale_x_continuous(labels = scales::label_percent()) +
  scale_y_continuous(labels = scales::label_percent()) +
  theme(legend.position = "none") +
  coord_cartesian(xlim = c(0, NA)) +
  theme(text = element_text(size = 18),
        plot.title.position = "plot")

Figure 1: Comparing word frequency of two sisters

Words lie near the line such as “home”, “head” and “half” indicate similar tendency to use that word, while those that are far from the line are words that are found more in one set of texts than another, for example “headthcliff”, “linton”, “catherine”, etc.

What does this plot tell us? Judged only by word frequency, it looks that there are a number of words that are quite characteristic of Emily Brontë (upper left corner). Charlotte, on the other hand, has few representative words (bottom right corner). We will investigate this further in the model.

Modeling

Data preprocessing

There are 428 and features (words) and 47266 observations in total. Approximately 18% of the response are 1 (Emily Brontë).

tidy_books %>%
  count(author) %>%
  mutate(prop = n / sum(n))
#> # A tibble: 2 × 3
#>   author               n  prop
#>   <fct>            <int> <dbl>
#> 1 Charlotte Brontë 64349 0.808
#> 2 Emily Brontë     15310 0.192

Now it’s time to widen our data to reach an appropriate model structure, this similar to a document-term matrix, with rows being a line and column word count.

library(tidymodels)
set.seed(2020)
doParallel::registerDoParallel()

model_df <- tidy_books %>%
  count(line_index, word) %>%
  pivot_wider(names_from = word, values_from = n,
              values_fill = list(n = 0)) %>%
  left_join(books, by = c("line_index" = "line_index")) %>%
  select(-title, -text)

model_df
#> # A tibble: 47,266 × 430
#>    line_index heights wuthering chapter returned visit heath…¹ black  eyes heart
#>         <int>   <int>     <int>   <int>    <int> <int>   <int> <int> <int> <int>
#>  1          1       1         1       0        0     0       0     0     0     0
#>  2          8       0         0       1        0     0       0     0     0     0
#>  3         11       0         0       0        1     1       0     0     0     0
#>  4         15       0         0       0        0     0       1     0     0     0
#>  5         17       0         0       0        0     0       0     1     1     1
#>  6         19       0         0       0        0     0       0     0     0     0
#>  7         22       0         0       0        0     0       1     0     0     0
#>  8         24       0         0       0        0     0       0     0     0     0
#>  9         26       0         0       0        0     0       0     0     0     0
#> 10         27       0         0       0        0     0       0     0     0     0
#> # … with 47,256 more rows, 420 more variables: fingers <int>, answer <int>,
#> #   sir <int>, hope <int>, grange <int>, heard <int>, thrushcross <int>,
#> #   interrupted <int>, walk <int>, closed <int>, uttered <int>, gate <int>,
#> #   words <int>, hand <int>, entered <int>, joseph <int>, bring <int>,
#> #   horse <int>, suppose <int>, nay <int>, dinner <int>, `heathcliff’s` <int>,
#> #   times <int>, guess <int>, wind <int>, house <int>, set <int>, strong <int>,
#> #   wall <int>, door <int>, earnshaw <int>, hareton <int>, short <int>, …
#> # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

Train a penalized logistic regression model

Split the data into training set and testing set.

book_split <- initial_split(model_df)
book_train <- training(book_split)
book_test <- testing(book_split)

Specify a L1 penalized logistic model, center and scale all predictors and combine them in to a workflow object.

logistic_spec <- logistic_reg(penalty = 0.05, mixture = 1) %>%
  set_engine("glmnet")

book_rec <- recipe(author ~ ., data = book_train) %>%
  update_role(line_index, new_role = "ID") %>%
  step_zv(all_predictors()) %>%
  step_normalize(all_predictors())

book_wf <- workflow() %>%
  add_model(logistic_spec) %>%
  add_recipe(book_rec)

initial_fit <- book_wf %>%
  fit(data = book_train)

initial_fit is a simple fitted regression model without any hyperparameters. By default glmnet calls for 100 values of lambda even if I specify \(\lambda = 0.05\). So the extracted result aren’t that helpful.

initial_fit %>%
  extract_fit_parsnip() %>%
  tidy()
#> # A tibble: 429 × 3
#>    term        estimate penalty
#>    <chr>          <dbl>   <dbl>
#>  1 (Intercept)   -1.63     0.05
#>  2 heights        0        0.05
#>  3 wuthering      0        0.05
#>  4 chapter        0        0.05
#>  5 returned       0        0.05
#>  6 visit          0        0.05
#>  7 heathcliff     0.134    0.05
#>  8 black          0        0.05
#>  9 eyes           0        0.05
#> 10 heart          0        0.05
#> # … with 419 more rows
#> # ℹ Use `print(n = ...)` to see more rows

We can make predictions with initial_fit anyway, and examine metrics like overall accuracy.

initial_predict <- predict(initial_fit, book_test) %>%
    bind_cols(predict(initial_fit, book_test, type = "prob")) %>%
    bind_cols(book_test %>% select(author, line_index))

initial_predict
#> # A tibble: 11,817 × 5
#>    .pred_class      `.pred_Charlotte Brontë` `.pred_Emily Brontë` author line_…¹
#>    <fct>                               <dbl>                <dbl> <fct>    <int>
#>  1 Charlotte Brontë                    0.840                0.160 Emily…      11
#>  2 Charlotte Brontë                    0.557                0.443 Emily…      15
#>  3 Charlotte Brontë                    0.840                0.160 Emily…      19
#>  4 Charlotte Brontë                    0.840                0.160 Emily…      31
#>  5 Charlotte Brontë                    0.840                0.160 Emily…      43
#>  6 Charlotte Brontë                    0.840                0.160 Emily…      59
#>  7 Charlotte Brontë                    0.840                0.160 Emily…      60
#>  8 Charlotte Brontë                    0.840                0.160 Emily…      69
#>  9 Charlotte Brontë                    0.840                0.160 Emily…      88
#> 10 Charlotte Brontë                    0.840                0.160 Emily…      89
#> # … with 11,807 more rows, and abbreviated variable name ¹​line_index
#> # ℹ Use `print(n = ...)` to see more rows

How good is our initial model?

initial_predict %>%
  accuracy(truth = author, estimate = .pred_class)
#> # A tibble: 1 × 3
#>   .metric  .estimator .estimate
#>   <chr>    <chr>          <dbl>
#> 1 accuracy binary         0.836

Nearly 84% of all predictions are right. This isn’t a very statisfactory result since “Charlotte Brontë” accounts for 81% of author, making our model only slightly better than a classifier that would assngin all author with “Charlotte Brontë” anyway.

Tuning lambda

We can figure out an appropriate penalty using resampling and tune the model.

logistic_wf_tune <- book_wf %>%
  update_model(logistic_spec %>% set_args(penalty = tune()))

lambda_grid <- grid_regular(penalty(), levels = 100)
book_folds <- vfold_cv(book_train, v = 10)

Here I build a set of 10 cross validations resamples, and set levels = 100 to try 100 choices of lambda ranging from 0 to 1.

Then I tune the grid:

logistic_results <- logistic_wf_tune %>%
  tune_grid(resamples = book_folds, grid = lambda_grid)

There is an autoplot() method for the tuned results, but I instead plot two metrics versus lambda respectivcely by myself.

logistic_results %>%
  collect_metrics() %>%
  mutate(lower_bound = mean - std_err,
         upper_bound = mean + std_err) %>%
  ggplot(aes(penalty, mean)) +
  geom_line(aes(color = .metric), size = 1.5, show.legend = FALSE) +
  geom_errorbar(aes(ymin = lower_bound, ymax = upper_bound)) +
  facet_wrap(~ .metric, nrow = 2, scales = "free") +
  labs(y = NULL, x = expression(lambda))

Classificationm metrics across strength of L1 regularization

Ok, the two metrics both display a monotone decrease as lambda increases, but does not exhibit much change once lambda is greater than 0.1, which is essentailly random guess according to the author’s respective proportion of appearance in the data. This plot shows that the model is generally better at small penalty, suggesting that the majority of the predictors are fairly important to the model. We may lean towards larger penalty with slightly worse performance, bacause they lead to simpler models. It follows that we may want to choose lambda in top rows in the following data frame

top_models <- logistic_results %>%
    show_best("roc_auc", n = 100) %>%
    arrange(desc(penalty)) %>%
    filter(mean > 0.9)

top_models
#> # A tibble: 77 × 7
#>     penalty .metric .estimator  mean     n std_err .config               
#>       <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                 
#>  1 0.00475  roc_auc binary     0.908    10 0.00188 Preprocessor1_Model077
#>  2 0.00376  roc_auc binary     0.913    10 0.00194 Preprocessor1_Model076
#>  3 0.00298  roc_auc binary     0.917    10 0.00210 Preprocessor1_Model075
#>  4 0.00236  roc_auc binary     0.920    10 0.00183 Preprocessor1_Model074
#>  5 0.00187  roc_auc binary     0.921    10 0.00202 Preprocessor1_Model073
#>  6 0.00148  roc_auc binary     0.921    10 0.00221 Preprocessor1_Model072
#>  7 0.00118  roc_auc binary     0.921    10 0.00222 Preprocessor1_Model071
#>  8 0.000933 roc_auc binary     0.921    10 0.00210 Preprocessor1_Model070
#>  9 0.000739 roc_auc binary     0.921    10 0.00211 Preprocessor1_Model069
#> 10 0.000586 roc_auc binary     0.921    10 0.00206 Preprocessor1_Model068
#> # … with 67 more rows
#> # ℹ Use `print(n = ...)` to see more rows

select_best() with return the 9th row with \(\lambda \approx 0.000586\) for its highest performance on roc_auc. But I’ll stick to the parsimonious principle and pick \(\lambda \approx 0.00376\) at the cost of a fall in roc_auc by 0.005 and in accuracy by 0.001.

logistic_results %>%
  select_best(metric = "roc_auc")
#> # A tibble: 1 × 2
#>    penalty .config               
#>      <dbl> <chr>                 
#> 1 0.000464 Preprocessor1_Model067

book_wf_final <- finalize_workflow(logistic_wf_tune,
                                  parameters = top_models %>% slice(1))

Now the model specification in the workflow is filled with the picked lambda:

book_wf_final %>% extract_spec_parsnip()
#> Logistic Regression Model Specification (classification)
#> 
#> Main Arguments:
#>   penalty = 0.00475081016210279
#>   mixture = 1
#> 
#> Computational engine: glmnet

The next thing is to fit the best model with the training set, and evaluate against the test set.

logistic_final <- last_fit(book_wf_final, split = book_split)

logistic_final %>%
  collect_metrics()
#> # A tibble: 2 × 4
#>   .metric  .estimator .estimate .config             
#>   <chr>    <chr>          <dbl> <chr>               
#> 1 accuracy binary         0.938 Preprocessor1_Model1
#> 2 roc_auc  binary         0.910 Preprocessor1_Model1
logistic_final %>%
  collect_predictions() %>%
  roc_curve(truth = author, `.pred_Emily Brontë`) %>%
  autoplot()

The accuracy of our logisitc model rises by a rough 9% to 93.8%, with roc_auc being nearly 0.904. This is pretty good!

There is also the confusion matrix to check. The model does well in identifying Charlotte Brontë (low false positive rate, high sensitivity), yet suffers relatively high false negative rate (mistakenly identify 39% of Emily Brontë as Charlotte Brontë, aka low specificity). In part, this is due to class imbalance (four out of five books were written by Charlotte).

logistic_final %>%
  collect_predictions() %>%
  conf_mat(truth = author, estimate = .pred_class)
#>                   Truth
#> Prediction         Charlotte Brontë Emily Brontë
#>   Charlotte Brontë             9872          727
#>   Emily Brontë                    1         1217

To examine the effect of predictors, I agian use fit and pull_workflow to extract model fit. Variable importance plots implemented in the vip package provides an intuitive way to visualize importance of predictors in this scenario, using the absolute value of the t-statistic as a measure of VI.

library(vip)

logistic_vi <- book_wf_final %>%
  fit(book_train) %>%
  extract_fit_parsnip() %>%
  vi(lambda = top_models[1, ]$penalty) %>%
  group_by(Sign) %>%
  slice_max(order_by = abs(Importance), n = 30) %>%
  ungroup() %>%
  mutate(Sign = if_else(Sign == "POS",
                        "More Emily Brontë",
                        "More Charlotte Brontë"))

logistic_vi %>%
  ggplot(aes(y = reorder_within(Variable, abs(Importance), Sign),
             x = Importance)) +
  geom_col(aes(fill = Sign),
           show.legend = FALSE, alpha = 0.6) +
  scale_y_reordered() +
  facet_wrap(~ Sign, nrow = 1, scales = "free") +
  labs(title = "How word usage classifies Brontë sisters",
       x = NULL,
       y = NULL) +
  theme(axis.text = element_text(size = 18),
        plot.title = element_text(size = 24),
        plot.title.position = "plot")

Variable importance plot for penalized logistic regression

Is it cheating to use names of a character to classify authors? Perhaps I should consider include more books and remove names for text classification next time.

Note that variale importance in the left panel is generally smaller than the right, this corresponds to what we find in the word frequency plot Figure 1 that Emily Brontë has more and stronger characteristic words.