In this post, I explore the feasibility of conducting long text classification using a pre-trained image processing model. I will explain how to convert a very long text into a time series and subsequently transform it into an image. These images can then be analyzed using powerful image classifiers to identify the category of the text.
In what follows, I will first discuss why converting text into a time series is intriguing. Then, I explain rationale for converting text into an image, and following that, I describe a method to do so. Finally, I will merge these components and conduct an experiment for classification of canonical and non-canonical texts.
Representation of texts in the form of time series, or more precisely the sequence of text properties, keeps the structural features, hence, allows analyzing the organization of text. This is especially important for analyzing long texts, such as literary texts, which are usually long and structure plays an important role in how those texts are perceived by the reader. Despite the importance of structure in analyzing text, the common approach in analyzing texts has been bag-of-word (or bag-of-features). There are however exceptions such as traditional probabilistic language models or more modern LSTM neural models.
One simple way to convert text into a time series is by constructing the sequence of sentence lengths, as the sentence is one of the most fundamental linguistic levels. Previous studies have shown that sentence length, despite its straightforwardness, reveals interesting information about the structure of texts. Human-written texts exhibit fractality in their global structure, and fractal features can be used to classify text categories. (ref??, also look at …).
In the following experiment, I also used sentence length to represent long fictional texts. Time series of sentence lengths are converted to images using Markov Transformation Field which I turn to in the next section.
The time-intrinsic aspect of texts and the discrete structure of textual units have likely been two of the main obstacles in text processing. If we could visualize textual information like an image, we could more effectively grasp it and analyze text. Fortunately, there are methods for converting time series to images. So, if we represent a text by time series as I explained in the previous section, we can utilize these methods to visualize text in the form of images. One of these methods is the Markov Transition Field (MTF) technique, introduced by Wang and Oates (2015) to convert a time series into an image using transition probabilities between elements of the series.
Give a time series
1- Assign
2- Calculate the transition probability matrix
3- To preserve the temporal information alongside the probabilities, construct the Markov transition field matrix
4- Visualize the MTF matrix
Before presenting real examples, I will illustrate MTF using a toy example. I used the pyts
package for Python and the sample code available on its website to convert a time series of the cos
function for the interval
Let’s focus on the diagonal of the image first, which reveals local fluctuations in the time series. As depicted in the plot, the cos function is compressed around
The image also shows some high probabilities (big red squares) far from the diagonal. What do they exhibit? According to step 3 of the MTF procedure, cos
function, the probability of transition from values near