Xlsxir is an Elixir library that parses .xlsx
files using Simple API for XML (SAX) parsing via the Erlsom Erlang library, extracts the data to an Erlang Term Storage (ETS) process and provides various functions for accessing the data. Xlsxir supports ISO 8601 date formats and large files.
Testing has been limited to various documents in Microsoft Excel and LibreOffice as well as any issues submitted through GitHub. Only English and Portuguese languages have been tested. A large worksheet containing 100 columns and 514K rows has successfully been parsed. Please submit any issues found and they will be addressed ASAP.
You can add Xlsxir as a dependancy to your Elixir project via the Hex package manager by adding the following to your mix.exs
file:
def deps do
[ {:xlsxir, "~> 1.4.1"} ]
end
Or, you can directly reference the GitHub repo:
def deps do
[ {:xlsxir, github: "kennellroxco/xlsxir"} ]
end
Xlsxir parses a .xlsx
file located at a given path
and extracts the data to an ETS process via the Xlsxir.extract/3
and Xlsxir.multi_extract/3
functions:
Xlsxir.extract(path, index, timer \\ false)
Xlsxir.multi_extract(path, index \\ nil, timer \\ false)
The multi_extract/3
function allows multiple worksheets to be parsed by creating a separate ETS process for each worksheet and returning a unique table identifier for each. This option will parse all worksheets by default
(when index == nil
), returning a list of tuple results.
Argument descriptions:
path
the path of the file to be parsed instring
formatindex
is the position of the worksheet you wish to parse (zero-based index)timer
is a boolean flag that controls an extraction timer that returns time elapsed when set totrue
. Defalut value isfalse
.
Upon successful completion, the extraction process returns:
- for
extract/3
::ok
withtimer
set tofalse
{:ok, time_elapsed}
withtimer
set totrue
- for
multi_extract/3
:[{:ok, table_1_id}, ...]
withtimer
set tofalse
{:ok, table_id}
when given a specific worksheetindex
[{:ok, table_1_id, time_elapsed}, ...]
withtimer
set totrue
{:ok, table_id, time_elapsed}
when given a specific worksheetindex
Unsucessful parsing of a specific worksheet returns {:error, reason}
.
The extracted worksheet data can be accessed using any of the following functions: ```elixir Xlsxir.get_list(table_id) Xlsxir.get_map(table_id) Xlsxir.get_mda(table_id) Xlsxir.get_cell(table_id, cell_ref) Xlsxir.get_row(table_id, row_num) Xlsxir.get_col(table_id, col_ltr) Xlsxir.get_info(table_id, num_type) ``` **Note:** `table_id` defaults to `:worksheet` and is therefore not required when using `Xlsxir.extract/3` to parse a given worksheet. The `table_id` parameter is only used with `Xlsxir.multi_extract/3`.
Xlsxir.get_list/1
returns entire worksheet in a list of row lists (i.e. [[row 1 values], ...]
)
Xlsxir.get_map/1
returns entire worksheet in a map of cell names and values (i.e. %{"A1" => value, ...}
)
Xlsxir.get_mda/1
returns entire worksheet in an indexed map which can be accessed like a multi-dimensional array (i.e. some_var[0][0]
for cell "A1")
Xlsxir.get_cell/2
returns value of specified cell (i.e. "A1"
returns value contained in cell A1)
Xlsxir.get_row/2
returns values of specified row (i.e. 1
returns the first row of data)
Xlsxir.get_col/2
returns values of specified column (i.e. "A"
returns the first column of data)
Xlsxir.get_info/1
and Xlsxir.get_multi_info/2
return count data for num_type
specified (i.e. :rows
, :cols
, :cells
, :all
)
Once the table data is no longer needed, run the following function to delete the ETS process and free memory:
Xlsxir.close(table_id)
When using Xlsxir.extract/3
, be sure to close an open ETS process before trying to parse another worksheet in the same session or process. If you try to open a new :worksheet
ETS process when one already exists, you will get an error. If the parsing of multiple worksheets is desired, use Xlsxir.multi_extract/3
instead.
Refer to Xlsxir documentation for more detailed examples.
Cell references are formatted as a string (i.e. "A1"). Strings will be returned as type string
, resulting values for functions from within the worksheet are returned as type string
, integer
or float
depending on the type of the resulting value, data formatted as a number in the worksheet will be returned as type integer
or float
, and ISO 8601 date formatted values will be returned in Erlang :calendar.date()
type format (i.e. {year, month, day}
). Xlsxir does not currently support dates prior to 1/1/1900.
- Additional performance improvement for larger files via GenStage.Flow
- Adding time support for dates (i.e. {{YYYY, MM, DD}, {h, m, s}})
- Export functionality to .xlsx file type with formatting options
- Implement Elixir 1.3 calendar datatypes support
Contributions are encouraged. Feel free to fork the repo, add your code along with appropriate tests and documentation (ensuring all existing tests continue to pass) and submit a pull request.
Please report any bugs or request future enhancements via the Issues page.
I'd like to thank the following people who were a big help in the development of this library:
- Paulo Almeida (@pma) was a big help with testing and has provided several great ideas for development.
- Benjamin Tan's (@benjamintanweihao) article on SAX parsing with Elrsom was invaluable.
- Daniel Berkompas' (@danielberkompas) article Multidimensional Arrays in Elixir inspired
Xlsxir.get_mda/0
.