Data frames are the essential part of the data science toolkit. They are the specialized data structures for tabular data sets that provide us with a simple and powerful API for summarizing, cleaning, and manipulating a wealth of data sources that are currently cumbersome to use. The DataFrame and DataSeries collections, stored in this repository, are specifically designed for working with structured data.
DataFrame library consists of two primary data structures:
DataFrame
is a spreadsheet-like tabular data structure that works like a relational database by providing the API for querying the data. Each row represents an observation, and every column is a feature. Both rows and columns of a data frame have names (keys) by which they can be accessed.DataSeries
is an array-like data structure used for working with specific rows or columns of a data frame. It has a name and contains an array of data mapped to a corresponding array of keys. DataSeries is a SequenceableCollection that combines the properties of an Array and a Dictionary, while extending the functionality of both by providing advanced messages for working with data, such as statistical summaries, visualizations etc.
The following script installs DataFrame and its dependencies into a Pharo image. Along with all the other code blocks in this tutorial, this script has been tested on Pharo-6.0 and Pharo64-6.0 for both Linux and OSX, and Pharo-6.0 for Windows.
Metacello new
baseline: 'DataFrame';
repository: 'github://PolyMathOrg/DataFrame';
load.
DataSeries can be created from an array of values
series := DataSeries fromArray: #(a b c).
By extending the Collection class DataFrame library provides us with a handy shortcut for converting any collection (e.g. an Array) to DataSeries
series := #(a b c) asDataSeries.
By default the keys will be initialized with an interval (1 to: self size)
. The name of a newly created series is considered empty and set by default to nil
. You can always change the name and keys of your series using these messages
series name: 'letters'.
series keys: #(k1 k2 k3).
When accessing the elements of a DataSeries, you can think of is as an Array. at:
message allows you to access elements by their index, with at:put:
you can modify the given element.
series at: 2. "b"
series at: 3 put: 'x'.
Besides the standard Array accessors, DataSeries provides additional operations for accessing elements by their keys
series atKey: #k2. "b"
series atKey: #k3 put: 'x'.
Messages for enumerating, such as do:
or withIndexDo:
work the same as in Array, and the collect:
message creates a new DataSerie preserving the name and keys of the receiver.
newSeries := series collect: [ :each | each, 'x' ].
newSeries name. "letters"
newSeries atKey: 'k1'. "ax"
Since all the elements of DataSeries are expected to have keys, in order to add a new element you must specify a key that should be associated with it
series add: 'x' atKey: #k4.
In future there will also be an add:
message that will choose a default key value or even predict the next element of a sequence of keys. For example, if keys are #(1 2 3 4 5)
, the new element will be associated with key 6
, if the sequence of keys is #(2 4 6)
, the next key will be 8
. Perhaps, it will even be possible to predict the next key in symbolic sequences, such as #(a b c d)
or #(k1 k2 k3)
. Such functionality is not easy to implement, so we leave it for future releases of DataFrame library.
Another way of adding an element to DataSeries is by using the atKey:put:
message with a non-existing key. Inspired by Pandas, atKey:put:
modifies an element at a given key if such key exists in a DataSeries, or adds a new element associeted with that key, if it wasn't found.
series atKey: #k4 put: 'x'.
Keep in mind that both add:atKey:
and atKey:put:
messages don't create a new series, but modify the existing one. So use them with caution.
There are four ways of creating a data frame:
The easiest and most straightforward way of creating a DataFrame is by passing all data in an array of arrays to fromRows:
or fromColumns:
message. Here is an example of initializing a DataFrame with rows:
df := DataFrame fromRows: #(
('Barcelona' 1.609 true)
('Dubai' 2.789 true)
('London' 8.788 false)).
The same data frame can be created from the array of columns
df := DataFrame fromColumns: #(
('Barcelona' 'Dubai' 'London')
(1.609 2.789 8.788)
(true true false)).
Since the names of rows and columns are not provided, they are initialized with their default values: (1 to: self numberOfRows)
and (1 to: self numberOfColumns)
. Both rowNames
and columnNames
can always be changed by passing an array of new names to a corresponding accessor. This array must be of the same size as the number of rows and columns.
df columnNames: #(City Population BeenThere).
df rowNames: #(A B C).
If you print (Ctrl+P) this data frame, you will this pretty-printed table that can be coppied and pasted into letters, blog posts, and tutorials (such as this one)
| City Population BeenThere
---+----------------------------------
A | Barcelona 1.609 true
B | Dubai 2.789 true
C | London 8.788 false
By it's nature DataFrame is similar to a matrix. It works like a table of values, supports matrix accessors, such as at:at:
or at:at:put:
and in some cases can be treated like a matrix. Some classes provide tabular data in matrix format. For example TabularWorksheet class of Tabular package that is used for reading XLSX files. To initialize a DataFrame from a maxtrix of values, use fromMatrix:
method
matrix := Matrix
rows: 3 columns: 3
contents:
#('Barcelona' 1.609 true
'Dubai' 2.789 true
'London' 8.788 false).
df := DataFrame fromMatrix: matrix.
Once again, the names of rows and columns are set to their default values.
In most real-world scenarios the data is located in a file or database. The support for database connections will be added in future releases. Right now DataFrame provides you the methods for loading data from two most commot file formats: CSV and XLSX
DataFrame fromCSV: 'path/to/your/file.csv'.
DataFrame fromXLSX: 'path/to/your/file.xlsx'.
Since JSON does not store data as a table, it is not possible to read such file directly into a DataFrame. However, you can parse JSON using NeoJSON or any other library, construct an array of rows and pass it to fromRows:
message, as described in previous sections.
DataFrame provides several famous datasets for you to play with. They are compact and can be loaded with a simple message. An this point there are three datasets that can be loaded in this way - Iris flower dataset, a simplified Boston Housing dataset, and Restaurant tipping dataset.
DataFrame loadIris.
DataFrame loadHousing.
DataFrame loadTips.
Rows and columns of a data frame can be accessed either by their names or their numeric indexes. You can access row 'C' and the column 'Population' of a data frame created in the previous sections by writing
df row: 'C'.
df column: 'Population'.
Alternatively, you can use numeric indexes. Here is how you can ask a data frame for a third row or a second column:
df rowAt: 3.
df columnAt: 2.
The important feature of a DataFrame
is that when asked for a specific row or column, it responds with a DataSeries
object that preserves the same indexing. This way, if you extract row 'B' from a data frame, it will still remember that 'Dubai' is a city with a population of 2.789 million people
| B
------------+-------
City | Dubai
Population | 2.789
BeenThere | true
You can access multiple columns at a same time by providing an array of column names or indexes, or by specifying the numeric range. For this purpose DataFrame provides messages rows:
, columns:
, rowsAt:
, columnsAt:
, rowsFrom:to:
, and columnsFrom:to:
df columns: #(City BeenThere).
df rowsAt: #(3 1).
df columnsFrom: 2 to: 3.
df rowsFrom: 3 to: 1.
The result will be a data frame with requested rows and columns in a given order. For example, the last line will give you a data frame "flipped upside-down" (with row indexes going in the descending order).
You can change the values of a specific row or column by passing an array or series of the same size to one of the messages: row:put:
, column:put:
, rowAt:put:
, columnAt:put:
. Be careful though, because these messages modify the data frame and may result in the loss of data.
df column: #BeenThere put: #(false true false).
As it was mentioned above, single cell of a data frame can be accessed with at:at:
and at:at:put:
messages
df at: 3 at: 2.
df at: 3 at: 2 put: true.
When working with bigger datasets it's often useful to access only the first or the last 5 rows. This can be done using head
and tail
messages. To see how they work let's load the Housing dataset.
df := DataFrame loadHousing.
This dataset has 489 entries. Printing all these rows in order to understand how this data looks like is unnecessary. On larger datasets it can also be time consuming. To take a quick look on your data, use df head
or df tail
| RM LSTAT PTRATIO MDEV
---+---------------------------------
1 | 6.575 4.98 15.3 504000.0
2 | 6.421 9.14 17.8 453600.0
3 | 7.185 4.03 17.8 728700.0
4 | 6.998 2.94 18.7 701400.0
5 | 7.147 5.33 18.7 760200.0
The resuld will be another data frame. head
and tail
messages are just shortcuts for df rowsFrom: 1 to: 5
and df rowsFrom: (df numberOfRows - 5) to: df numberOfRows.
. But what if you want a different number of rows? You can do that using parametrized messages head:
and tail:
with a given number of rows.
df head: 10.
df tail: 3.
You can also look at the head or tail of a specific column, since all these messages are also supported by DataSeries
(df column: #LSTAT) head: 2.
The result will be another series
| LSTAT
---+-------
1 | 4.98
2 | 9.14
New rows and columns can be appended to the data frame using messages addRow:named
and addColumn:named
. Like in the case of DataSeries, you must provide a name for these new elements, since it can not continue the existing sequence of names.
df addRow: #('Lviv' 0.724 true) named: #D.
df addColumn: #(4 3 4) named: #Rating.
The same can be done using messages row:put:
and column:put:
with non-existing keys. DataFrame will append the new key and associate it with a given row or column
df at: #D put: #('Lviv' 0.724 true).
df at: #Rating put: #(4 3 4).
SELECT is the most commonly used SQL statement that allows you to subset your data by applying filters to it using WHERE clause. The query language of DataFrame is designed to resemble SQL, so if you have some experience with relational databases, you should "feel like home".
The examples in this section will be using Iris dataset
df := DataFrame loadIris.
There are two things you need to specify in order to subset your data with select:where:
message:
- What features (columns) do you want to get
- What conditions should the observations (rows) satisfy in order to be selected
First argument of the select:where:
message should be an array of column names. They will not affect the selection of rows, but the resulting data frame will contain only these columns. Second argument should be a block with boolean conditions that will be applied to each row of data frame. Only those rows that make a block return true
will be selected. In your conditions you will be referencing the features of your observations. For example, in Iris dataset you might want to select those flowers that belong to #setosa
species and have the width of sepal equal to 3
. To make queries more readable, DataFrame provides a querying language that allows you to specify the columns which you are using in your conditions as arguments of the where-block, and use these arguments in your conditions. So, for example, a block [ :species | species = #setosa ]
passed to select:where:
message will be translated to [ :row | (row atKey: #species) = #setosa ]
and applied to every row of data frame. This means that all the arguments of the block you pass must correspond to the column names of your data frame.
Here is a query that selects petal_width
and petal_length
columns, and all the rows that satisfy the condition described in the previous paragraph
df select: #(petal_width petal_length)
where: [ :species :sepal_width |
species = #setosa and: sepal_width = 3 ].
If you rather want to select all the columns of a data frame, use the selectAllWhere:
message. It works in a same way as SELECT * WHERE
in SQL
df selectAllWhere: [ :species :sepal_width |
species = #setosa and: sepal_width = 3 ].
This query will return a data frame will all 5 columns of Iris dataset and 6 rows that satisfy the given condition.
| sepal_length sepal_width petal_length petal_width species
----+---------------------------------------------------------------
2 | 4.9 3 1.4 0.2 setosa
13 | 4.8 3 1.4 0.1 setosa
14 | 4.3 3 1.1 0.1 setosa
26 | 5 3 1.6 0.2 setosa
39 | 4.4 3 1.3 0.2 setosa
46 | 4.8 3 1.4 0.3 setosa
The previous query will return you only the petal_width
and petal_length
columns of this data frame. Try it yourself!
df select: #(sepal_length species)
where: [ :petal_length :petal_width |
(petal_length < 4.9 and: petal_length > 1.6) and:
(petal_width < 0.4 or: petal_width > 1.5) ]
groupBy: #species
aggregate: #sum.
The result of this query will be a data frame with a single column
| sepal_length
------------+--------------
setosa | 15.9
versicolor | 18.2
virginica | 17.1