Install Ethereum ETL:
pip3 install ethereum-etl
Export blocks and transactions (Schema, Reference):
> ethereumetl export_blocks_and_transactions --start-block 0 --end-block 500000 \
--provider-uri https://mainnet.infura.io --blocks-output blocks.csv --transactions-output transactions.csv
Export ERC20 and ERC721 transfers (Schema, Reference):
> ethereumetl export_token_transfers --start-block 0 --end-block 500000 \
--provider-uri file://$HOME/Library/Ethereum/geth.ipc --output token_transfers.csv
Export receipts and logs (Schema, Reference):
> ethereumetl export_receipts_and_logs --transaction-hashes transaction_hashes.txt \
--provider-uri https://mainnet.infura.io --receipts-output receipts.csv --logs-output logs.csv
Export ERC20 and ERC721 token details (Schema, Reference):
> ethereumetl export_tokens --token-addresses token_addresses.csv \
--provider-uri https://mainnet.infura.io --output tokens.csv
Export traces (Schema, Reference):
> ethereumetl export_traces --start-block 0 --end-block 500000 \
--provider-uri file://$HOME/Library/Ethereum/parity.ipc --output traces.csv
For the latest version, check out the repo and call
> pip3 install -e .
> python3 ethereumetl.py
- Schema
- Exporting the Blockchain
- Ethereum Classic Support
- Querying in Amazon Athena
- Querying in Google BigQuery
Column | Type |
---|---|
number | bigint |
hash | hex_string |
parent_hash | hex_string |
nonce | hex_string |
sha3_uncles | hex_string |
logs_bloom | hex_string |
transactions_root | hex_string |
state_root | hex_string |
receipts_root | hex_string |
miner | address |
difficulty | numeric |
total_difficulty | numeric |
size | bigint |
extra_data | hex_string |
gas_limit | bigint |
gas_used | bigint |
timestamp | bigint |
transaction_count | bigint |
Column | Type |
---|---|
hash | hex_string |
nonce | bigint |
block_hash | hex_string |
block_number | bigint |
transaction_index | bigint |
from_address | address |
to_address | address |
value | numeric |
gas | bigint |
gas_price | bigint |
input | hex_string |
Column | Type |
---|---|
token_address | address |
from_address | address |
to_address | address |
value | numeric |
transaction_hash | hex_string |
log_index | bigint |
block_number | bigint |
Column | Type |
---|---|
transaction_hash | hex_string |
transaction_index | bigint |
block_hash | hex_string |
block_number | bigint |
cumulative_gas_used | bigint |
gas_used | bigint |
contract_address | address |
root | hex_string |
status | bigint |
Column | Type |
---|---|
log_index | bigint |
transaction_hash | hex_string |
transaction_index | bigint |
block_hash | hex_string |
block_number | bigint |
address | address |
data | hex_string |
topics | string |
Column | Type |
---|---|
address | address |
bytecode | hex_string |
function_sighashes | string |
is_erc20 | boolean |
is_erc721 | boolean |
Column | Type |
---|---|
address | address |
symbol | string |
name | string |
decimals | bigint |
total_supply | numeric |
Column | Type |
---|---|
block_number | bigint |
transaction_hash | hex_string |
transaction_index | bigint |
from_address | address |
to_address | address |
value | numeric |
input | hex_string |
output | hex_string |
trace_type | string |
call_type | string |
reward_type | string |
gas | bigint |
gas_used | bigint |
subtraces | bigint |
trace_address | string |
error | string |
You can find column descriptions in https://github.com/medvedev1088/ethereum-etl-airflow
Note: for the address
type all hex characters are lower-cased.
boolean
type can have 2 values: True
or False
.
- In case the contract is a proxy, which forwards all calls to a delegate, interface detection doesn’t work,
which means
is_erc20
andis_erc721
will always be false for proxy contracts and they will be missing in thetokens
table. - The metadata methods (
symbol
,name
,decimals
,total_supply
) for ERC20 are optional, so around 10% of the contracts are missing this data. Also some contracts (EOS) implement these methods but with wrong return type, so the metadata columns are missing in this case as well. token_transfers.value
,tokens.decimals
andtokens.total_supply
have typeSTRING
in BigQuery tables, because numeric types there can't handle 32-byte integers. You should usecast(value as FLOAT64)
(possible loss of precision) orsafe_cast(value as NUMERIC)
(possible overflow) to convert to numbers.- The contracts that don't implement
decimals()
function but have the fallback function that returns aboolean
will have0
or1
in thedecimals
column in the CSVs.
-
Install python 3.5.3+ https://www.python.org/downloads/
-
You can use Infura if you don't need ERC20 transfers (Infura doesn't support eth_getFilterLogs JSON RPC method). For that use
-p https://mainnet.infura.io
option for the commands below. If you need ERC20 transfers or want to export the data ~40 times faster, you will need to set up a local Ethereum node: -
Install geth https://github.com/ethereum/go-ethereum/wiki/Installing-Geth
-
Start geth. Make sure it downloaded the blocks that you need by executing
eth.syncing
in the JS console. You can export blocks belowcurrentBlock
, there is no need to wait until the full sync as the state is not needed (unless you also need contracts bytecode and token details; for those you need to wait until the full sync). -
Install Ethereum ETL:
> pip3 install ethereum-etl
-
Export all:
> ethereumetl export_all --help > ethereumetl export_all -s 0 -e 5999999 -b 100000 -p file://$HOME/Library/Ethereum/geth.ipc -o output
In case
ethereumetl
command is not available in PATH, usepython3 -m ethereumetl
instead.The result will be in the
output
subdirectory, partitioned in Hive style:output/blocks/start_block=00000000/end_block=00099999/blocks_00000000_00099999.csv output/blocks/start_block=00100000/end_block=00199999/blocks_00100000_00199999.csv ... output/transactions/start_block=00000000/end_block=00099999/transactions_00000000_00099999.csv ... output/token_transfers/start_block=00000000/end_block=00099999/token_transfers_00000000_00099999.csv ...
Should work with geth and parity, on Linux, Mac, Windows.
If you use Parity you should disable warp mode with --no-warp
option because warp mode
does not place all of the block or receipt data into the database https://wiki.parity.io/Getting-Synced
If you see weird behavior, e.g. wrong number of rows in the CSV files or corrupted files, check out this issue: blockchain-etl#28
You can use AWS Auto Scaling and Data Pipeline to reduce the exporting time to a few hours. Read this article for details https://medium.com/@medvedev1088/how-to-export-the-entire-ethereum-blockchain-to-csv-in-2-hours-for-10-69fef511e9a2
-
Install Docker https://docs.docker.com/install/
-
Build a docker image
> docker build -t ethereum-etl:latest . > docker image ls
-
Run a container out of the image
> docker run -v $HOME/output:/ethereum-etl/output ethereum-etl:latest export_all -s 0 -e 5499999 -b 100000 -p https://mainnet.infura.io > docker run -v $HOME/output:/ethereum-etl/output ethereum-etl:latest export_all -s 2018-01-01 -e 2018-01-01 -p https://mainnet.infura.io
- export_blocks_and_transactions
- export_token_transfers
- extract_token_transfers
- export_receipts_and_logs
- export_contracts
- export_tokens
- export_traces
- export_geth_traces
- extract_geth_traces
- get_block_range_for_date
- get_keccak_hash
All the commands accept -h
parameter for help, e.g.:
> ethereumetl export_blocks_and_transactions -h
Usage: ethereumetl export_blocks_and_transactions [OPTIONS]
Export blocks and transactions.
Options:
-s, --start-block INTEGER Start block
-e, --end-block INTEGER End block [required]
-b, --batch-size INTEGER The number of blocks to export at a time.
-p, --provider-uri TEXT The URI of the web3 provider e.g.
file://$HOME/Library/Ethereum/geth.ipc or
https://mainnet.infura.io
-w, --max-workers INTEGER The maximum number of workers.
--blocks-output TEXT The output file for blocks. If not provided
blocks will not be exported. Use "-" for stdout
--transactions-output TEXT The output file for transactions. If not
provided transactions will not be exported. Use
"-" for stdout
-h, --help Show this message and exit.
For the --output
parameters the supported types are csv and json. The format type is inferred from the output file name.
> ethereumetl export_blocks_and_transactions --start-block 0 --end-block 500000 \
--provider-uri file://$HOME/Library/Ethereum/geth.ipc \
--blocks-output blocks.csv --transactions-output transactions.csv
Omit --blocks-output
or --transactions-output
options if you want to export only transactions/blocks.
You can tune --batch-size
, --max-workers
for performance.
The API used in this command is not supported by Infura, so you will need a local node. If you want to use Infura for exporting ERC20 transfers refer to extract_token_transfers
> ethereumetl export_token_transfers --start-block 0 --end-block 500000 \
--provider-uri file://$HOME/Library/Ethereum/geth.ipc --batch-size 100 --output token_transfers.csv
Include --tokens <token1> --tokens <token2>
to filter only certain tokens, e.g.
> ethereumetl export_token_transfers --start-block 0 --end-block 500000 \
--provider-uri file://$HOME/Library/Ethereum/geth.ipc --output token_transfers.csv \
--tokens 0x86fa049857e0209aa7d9e616f7eb3b3b78ecfdb0 --tokens 0x06012c8cf97bead5deae237070f9587f8e7a266d
You can tune --batch-size
, --max-workers
for performance.
First extract transaction hashes from transactions.csv
(Exported with export_blocks_and_transactions):
> ethereumetl extract_csv_column --input transactions.csv --column hash --output transaction_hashes.txt
Then export receipts and logs:
> ethereumetl export_receipts_and_logs --transaction-hashes transaction_hashes.txt \
--provider-uri file://$HOME/Library/Ethereum/geth.ipc --receipts-output receipts.csv --logs-output logs.csv
Omit --receipts-output
or --logs-output
options if you want to export only logs/receipts.
You can tune --batch-size
, --max-workers
for performance.
Upvote this feature request openethereum/parity-ethereum#9075, it will make receipts and logs export much faster.
First export receipt logs with export_receipts_and_logs.
Then extract transfers from the logs.csv file:
> ethereumetl extract_token_transfers --logs logs.csv --output token_transfers.csv
You can tune --batch-size
, --max-workers
for performance.
First extract contract addresses from receipts.csv
(Exported with export_receipts_and_logs):
> ethereumetl extract_csv_column --input receipts.csv --column contract_address --output contract_addresses.txt
Then export contracts:
> ethereumetl export_contracts --contract-addresses contract_addresses.txt \
--provider-uri file://$HOME/Library/Ethereum/geth.ipc --output contracts.csv
You can tune --batch-size
, --max-workers
for performance.
First extract token addresses from contracts.json
(Exported with export_contracts):
> ethereumetl filter_items -i contracts.json -p "item['is_erc20'] or item['is_erc721']" | \
ethereumetl extract_field -f address -o token_addresses.txt
Then export ERC20 / ERC721 tokens:
> ethereumetl export_tokens --token-addresses token_addresses.txt \
--provider-uri file://$HOME/Library/Ethereum/geth.ipc --output tokens.csv
You can tune --max-workers
for performance.
Also called internal transactions.
The API used in this command is not supported by Infura,
so you will need a local Parity archive node (parity --tracing on
).
Make sure your node has at least 8GB of memory, or else you will face timeout errors.
See this issue
> ethereumetl export_traces --start-block 0 --end-block 500000 \
--provider-uri file://$HOME/Library/Ethereum/parity.ipc --batch-size 100 --output traces.csv
You can tune --batch-size
, --max-workers
for performance.
Read Differences between geth and parity traces.csv
The API used in this command is not supported by Infura,
so you will need a local Geth archive node (geth --gcmode archive --syncmode full --ipcapi debug
).
When using rpc, add --rpc --rpcapi debug
options.
> ethereumetl export_geth_traces --start-block 0 --end-block 500000 \
--provider-uri file://$HOME/Library/Ethereum/geth.ipc --batch-size 100 --output geth_traces.json
You can tune --batch-size
, --max-workers
for performance.
> ethereumetl extract_geth_traces --input geth_traces.json --output traces.csv
You can tune --batch-size
, --max-workers
for performance.
> ethereumetl get_block_range_for_date --provider-uri=https://mainnet.infura.io --date 2018-01-01
4832686,4838611
> ethereumetl get_keccak_hash -i "transfer(address,uint256)"
0xa9059cbb2ab09eb219583f4a59a5d0623ade346d962bcd4e46b11da047c9049b
> pip3 install -e .[dev]
> export ETHEREUM_ETL_RUN_SLOW_TESTS=True
> pytest -vv
> pip3 install tox
> tox
For getting ETC csv files, make sure you pass in the --chain classic
param where it's required for the scripts you want to export.
ETC won't run if your --provider-uri
is Infura. It will provide a warning and change the provider-uri to https://ethereumclassic.network
instead. For faster performance, run a client instead locally for classic such as parity chain=classic
and Geth-classic.
to_address
field differs forcallcode
trace (geth seems to return correct value, as parity value ofto_address
is same asto_address
of parent call);- geth output doesn't have
reward
traces; - geth output doesn't have
to_address
,from_address
,value
forsuicide
traces; error
field contains human readable error message, which might differ in geth/parity output;- geth output doesn't have
transaction_hash
; gas_used
is 0 on traces with error in geth, empty in parity;- zero output of subcalls is
0x000...
in geth,0x
in parity;
- Upload the files to S3:
> cd output
> aws s3 sync . s3://<your_bucket>/ethereumetl/export --region ap-southeast-1
-
Sign in to Athena https://console.aws.amazon.com/athena/home
-
Create a database:
CREATE DATABASE ethereumetl;
- Create the tables:
- blocks: schemas/aws/blocks.sql
- transactions: schemas/aws/transactions.sql
- token_transfers: schemas/aws/token_transfers.sql
- contracts: schemas/aws/contracts.sql
- receipts: schemas/aws/receipts.sql
- logs: schemas/aws/logs.sql
- tokens: schemas/aws/tokens.sql
Refer to https://github.com/medvedev1088/ethereum-etl-airflow for the instructions.
Read this article on how to convert CSVs to Parquet https://medium.com/@medvedev1088/converting-ethereum-etl-files-to-parquet-399e048ddd30
- Create the tables:
- parquet_blocks: schemas/aws/parquet/parquet_blocks.sql
- parquet_transactions: schemas/aws/parquet/parquet_transactions.sql
- parquet_token_transfers: schemas/aws/parquet/parquet_token_transfers.sql
Note that DECIMAL type is limited to 38 digits in Hive https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-decimal so values greater than 38 decimals will be null.
You can query the data that's updated daily in the public BigQuery dataset https://medium.com/@medvedev1088/ethereum-blockchain-on-google-bigquery-283fb300f579
Read this article https://medium.com/google-cloud/how-to-query-balances-for-all-ethereum-addresses-in-bigquery-fb594e4034a7
Read this article https://medium.com/google-cloud/building-token-recommender-in-google-cloud-platform-1be5a54698eb