---
title: "Getting Started with pdfsearch"
author: "Brandon LeBeau"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Getting Started with pdfsearch}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

This package defines a few useful functions for keyword searching using the [pdftools](https://github.com/ropensci/pdftools)  package developed by [rOpenSci](https://ropensci.org/).

# Basic Usage
There are currently two functions in this package of use to users. The first `keyword_search` takes a single pdf and searches for keywords from the pdf. The second `keyword_directory` does the same search over a directory of pdfs.

# `keyword_search` Example
The package comes with two pdf files from [arXiv](https://arxiv.org/) to use as test cases. Below is an example of using the `keyword_search` function.
```{r search1, message = FALSE}
library(pdfsearch)
file <- system.file('pdf', '1610.00147.pdf', package = 'pdfsearch')

result <- keyword_search(file, 
            keyword = c('measurement', 'error'),
            path = TRUE)
head(result)
head(result$line_text, n = 2)
```

The location of the keyword match, including page number and line number, and the actual line of text are returned by default.

## Surrounding lines of text 
It may be useful to extract not just the line of text that the keyword is found in, but also surrounding text to have additional context when looking at the keyword results. This can be added by using the argument `surround_lines` as follows:
```{r surround}
file <- system.file('pdf', '1610.00147.pdf', package = 'pdfsearch')

result <- keyword_search(file, 
            keyword = c('measurement', 'error'),
            path = TRUE, surround_lines = 1)
head(result)
head(result$line_text, n = 2)
```

## Combine hyphenated words
Typeset PDF files commonly contain words that wrap from one line to the next and are hyphenated. An example of this is shown in the following image.

![hyphenated example](hyphenated.png)

Any hyphenated words are treated as two words and the keyword search may not perform as desired if a matching word would be returned if it is not hyphenated. Fortunately, there is a `remove_hyphen` argument within the `keyword_search` function that removes the hyphenated words at the end of a line and combines them with the word on the next line in the document. Below is an example of this working, showing the results before and after using the `remove_hyphen` argument. By default this argument is set to TRUE. 

```{r remove_hyphen}
file <- system.file('pdf', '1610.00147.pdf', package = 'pdfsearch')

result_hyphen <- keyword_search(file, 
            keyword = c('measurement'),
            path = TRUE, remove_hyphen = FALSE)

result_remove_hyphen <- keyword_search(file, 
            keyword = c('measurement'),
            path = TRUE, remove_hyphen = TRUE)

nrow(result_hyphen)
nrow(result_remove_hyphen)
```

You'll notice that the removal of the hyphen added a few additional keyword matches to the results. These were cases where the word "measurement" wrapped across two lines and was hyphenated (see the image above that has an example of this). 

One specific note about removing hyphens in multiple column PDF files. The ability of the function to perform this action is still experimental and many times does not work the best as of yet. Use the `remove_hyphen` argument with caution with multiple column PDF files.

## Split document into words

Using the [tokenizers](https://github.com/ropensci/tokenizers) R package, it is also possible to split the document into individual words. This may be most useful when the interest is in performing a text analysis rather than a keyword search. Below is an example showing the first page of the text converted to words. By default, hyphenated words at the end of the lines are removed (see previous section for description of this). 

```{r convert_tokens_text}
token_result <- convert_tokens(file, path = TRUE)[[1]]
head(token_result)
```

Another implementation of the `convert_tokens` function, is to convert the result text to tokens. This could be interesting when used in tandem with the surround_lines argument for input into a text analysis. These tokens are included by default when calling the `keyword_search` function.

```{r convert_tokens_result}
file <- system.file('pdf', '1501.00450.pdf', package = 'pdfsearch')

result <- keyword_search(file, 
            keyword = c('repeated measures', 'mixed effects'),
            path = TRUE, surround_lines = 1)
result
```


# `keyword_directory` Example
The `keyword_directory` function is useful when you have a directory of many pdf files that you want to search a series of keywords in a single function call. This can be particularly useful in the context of a research synthesis or to screen studies for characteristics to include in a meta-analysis.

There are two files that come with the package from ArXiv in a single directory that will be used as an example use case for the package.

```{r directory}
directory <- system.file('pdf', package = 'pdfsearch')

result <- keyword_directory(directory,
                           keyword = c('repeated measures', 'mixed effects',
                                       'error'),
                           surround_lines = 1, full_names = TRUE)
head(result)
```

The `full_names` argument is needed here to specify that the full file path needs to be used to access the pdf files. If the search is done directly from the repository (i.e. when using an R project in RStudio), then `full_names` could be set to FALSE.

# Limitations
Currently there are a handful of limitations, mostly around how pdfs are read into R using the pdftools R package. When pdfs are created in a multiple column layout, a line in the pdf consists of the entire line across both columns. This can lead to fragmented text that may not give the full contents, even with using the `surround_lines` argument.

Another limitation is when performing keyword searching with multiple words or phrases. If the match is on a single line, the match would be returned. However, if the words or phrase spans multiple lines, the current implementation will not return a result that spans multiple lines in the PDF file.


# Shiny App
The package also has a simple Shiny app that can be called using the following command
```{r shiny, eval = FALSE}
run_shiny()
```