Skip to content

Commit

Permalink
feat: Improving documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
the-superpirate committed Sep 22, 2023
1 parent 3e6b8a4 commit ce42212
Show file tree
Hide file tree
Showing 6 changed files with 79 additions and 105 deletions.
24 changes: 12 additions & 12 deletions cybrex/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,13 @@ You should have [installed IPFS](http://standard-template-construct.org/#/help/i

Then, you should install cybrex package
```bash
pip install cybrex
ultranymous@nevermore:~ pip install cybrex
```

and launch qdrant database for storing vectors:

```bash
docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant
ultranymous@nevermore:~ docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant
```

Upon its initial launch, `cybrex` will create a `~/.cybrex` directory containing a `config.yaml` file and a `chroma` directory.
Expand All @@ -33,23 +33,23 @@ STC contains metadata for the most of the items, but `links` or `content` fields

```console
# (Optional) Launch Summa search engine, then you will not have to wait bootstrapping every time.
# It will take a time!
# It will take a time! Wait until the text `Serving on ...` appears
# If you decided to launch it, switch to another Terminal window
geck --ipfs-http-base-url 127.0.0.1:8080 - serve
ultranymous@nevermore:~ geck --ipfs-http-base-url 127.0.0.1:8080 - serve
```

Now we should initialize Cybrex and choose which models will be used:

``` console
cybrex - write-config --force
```console
ultranymous@nevermore:~ cybrex - write-config --force
# or if you want to use OpenAI model, export keys and you should set appropriate models in config:
export OPENAI_API_KEY=...
cybrex - write-config -l openai --force
ultranymous@nevermore:~ export OPENAI_API_KEY=...
ultranymous@nevermore:~ cybrex - write-config -l openai --force
# or if you want to use GPU:
cybrex - write-config --device cuda --force
ultranymous@nevermore:~ cybrex - write-config --device cuda --force

# Summarize a document
cybrex - sum-doc doi:10.1155/2022/7138756
ultranymous@nevermore:~ cybrex - sum-doc doi:10.1155/2022/7138756

Document: doi:10.1155/2022/7138756
Summarization: Resveratrol is a natural compound found in various plants and has been studied for
Expand All @@ -64,7 +64,7 @@ to activate the host's immune defences, affect the TLRs/NF-κB signalling pathwa
viral gene expression.

# Question a document
cybrex - chat-doc doi:10.1155/2022/7138756 \
ultranymous@nevermore:~ cybrex - chat-doc doi:10.1155/2022/7138756 \
--query "What is the antivirus effect of resveratrol?"

Q: What is the antivirus effect of resveratrol?
Expand All @@ -78,7 +78,7 @@ syncytial virus (RSV) and to stimulate the secretion of higher levels of TNF-α,
and RSV clearance.

# Question enitre science
cybrex - chat-sci "What is the antivirus effect of resveratrol?" --n-chunks 4 --n-documents 10
ultranymous@nevermore:~ cybrex - chat-sci "What is the antivirus effect of resveratrol?" --n-chunks 4 --n-documents 10

Q: What is the antivirus effect of resveratrol?
A: Resveratrol has been found to possess antiviral activity against a variety of viruses, including herpes simplex virus, human immunodeficiency virus, and hepatitis C virus. It has been shown to inhibit the replication of several viruses, including HIV, herpes simplex virus, and influenza virus, and to regulate TLR3 expression, thus affecting the recruitment of downstream related factors and finally affecting the regulation process of related signal pathways. It has also been studied for its antiviral activity against Reoviridae, and for its potential to inhibit Zika virus cytopathy effect. It has been active against Epstein virus, rotavirus, and vesicular stomatitis virus, and has been reported to alleviate virus-induced reproductive failure and to promote RSV clearance in the body more quickly.
Expand Down
97 changes: 55 additions & 42 deletions geck/README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# GECK (Garden of Eden Creation Kit)

GECK is a Python library and Bash tool for access STC - the large corpus of scholarly texts.
GECK includes embedded search engine [Summa](https://github.com/izihawa/summa), helps to feed it with prepared IPFS-based databases, do search queries over these databases and iterate over all documents if you need.
GECK is a Python library and Bash tool for deploy andaccess STC - the large corpus of scholarly texts.
GECK includes embedded search engine [Summa](https://github.com/izihawa/summa), helps to feed it with a prepared IPFS-based database of scholarly texts, do search queries over the database and iterate over all documents if you need.

## Install

Expand All @@ -20,14 +20,14 @@ STC contains metadata for the most of the items, but `links` or `content` fields
### CLI

```console
# (Optional) Launch Summa search engine, then you will not have to wait bootstrapping every time.
# It will take a time!
# (Optional) Launch standalone Summa search engine, then you will not have to wait bootstrapping every time.
# It will take a time! Wait until the text `Serving on ...` appears
# If you decided to launch it, switch to another Terminal window
geck - serve
ultranymous@nevermore:~ geck - serve
INFO: Serving on 127.0.0.1:10082

# Iterate over all stored documents
geck - documents

ultranymous@nevermore:~ geck - documents
INFO: Setting up indices...
{"authors":[{"family":"Manresa Presasa","given":"JM","sequence":"first"},{"family":"Rebull Fatsinib","given":"J","sequence":"additional"},{"family":"Miravalls Figuerolac","given":"M","sequence":"additional"},{"family":"Caballol Angelatsd","given":"R","sequence":"additional"},{"family":"Minué Magañae","given":"P","sequence":"additional"},{"family":"Juan Franquetf","given":"R","sequence":"additional"}],"ctr":0.1,"custom_score":1.0,"doi":"10.1157/13053458","issued_at":7376313600,"language":"es","metadata":{"container_title":"Atención Primaria","first_page":435,"issns":["0212-6567","1578-1275"],"issue":"7","last_page":436,"publisher":"Elsevier BV","volume":"32"},"page_rank":0.16246586,"referenced_by_count":5,"tags":["Family Practice","General Medicine"],"title":"La espirometría en el diagnóstico de la enfermedad pulmonar obstructiva crónica en atención primaria","type":"journal-article","updated_at":1687530735}
{"authors":[{"family":"Yanes Baonza","given":"M","sequence":"first"},{"family":"Ferrer García-Borrás","given":"JM","sequence":"additional"},{"family":"Cabrera Majada","given":"A","sequence":"additional"},{"family":"Sánchez González","given":"R","sequence":"additional"}],"ctr":0.1,"custom_score":1.0,"doi":"10.1157/13053456","issued_at":7376313600,"language":"es","metadata":{"container_title":"Atención Primaria","first_page":438,"issns":["0212-6567","1578-1275"],"issue":"7","last_page":438,"publisher":"Elsevier BV","volume":"32"},"page_rank":0.15,"referenced_by_count":0,"tags":["Family Practice","General Medicine"],"title":"Sonambulismo asociado con zolpidem","type":"journal-article","updated_at":1687530735}
Expand All @@ -38,60 +38,73 @@ INFO: Setting up indices...
{"authors":[{"family":"Carmona Ibáñez","given":"G","sequence":"first"},{"family":"Guevara Serrano","given":"J","sequence":"additional"}],"ctr":0.1,"custom_score":1.0,"doi":"10.1157/13053464","issued_at":7376313600,"language":"es","metadata":{"container_title":"Atención Primaria","first_page":415,"issns":["0212-6567","1578-1275"],"issue":"7","last_page":419,"publisher":"Elsevier BV","volume":"32"},"page_rank":0.15,"referenced_by_count":0,"tags":["Family Practice","General Medicine"],"title":"Estudio de la marca en la prescripción de genéricos en 6 centros de salud durante el año 2001","type":"journal-article","updated_at":1687530735}

# Do a match search by field
geck - search doi:10.3384/ecp1392a41

ultranymous@nevermore:~ geck - search doi:10.3384/ecp1392a41
INFO: Setting up indices...
INFO: Searching doi:10.3384/ecp1392a41...
{"abstract": "In recent years, water hydraulics has been getting more <...> "type": "proceedings-article", "updated_at": 1687530737}

# Do a match search by word. In the example below documents are cut for displaying reason
geck - search hemoglobin --limit 3

ultranymous@nevermore:~ geck - search hemoglobin --limit 3
INFO: Setting up indices...
INFO: Searching hemoglobin...
{"abstract": "Abstract\nWe exa <...>
{"abstract": "Abstract\nUsing a <...>
{"abstract": "Regional cerebral <...>
```

You can add `--debug` flag after `geck` to enable debugging output.

### Python

```python
import json
import argparse
import asyncio

from stc_geck.advices import format_document
from stc_geck.client import StcGeck

DEFAULT_LIMIT = 5


async def main(limit: int):
geck = StcGeck(
ipfs_http_base_url='http://127.0.0.1:8080',
timeout=300,
)

# Connects to IPFS and instantiate configured indices for searching
# It will take a time depending on your IPFS performance
await geck.start()

# GECK encapsulates Python client to Summa.
# It can be either external stand-alone server or embed server,
# but details are hidden behind `SummaClient` interface.
summa_client = geck.get_summa_client()

# Match search returns top-5 documents which contain `additive manufacturing` in their title, abstract or content.
documents = await summa_client.search_documents({
"index_alias": "nexus_science",
"query": {
"match": {
"value": "additive manufacturing",
"query_parser_config": {"default_fields": ["abstract", "title", "content"]}
}
},
"collectors": [{"top_docs": {"limit": limit}}],
"is_fieldnorms_scoring_enabled": False,
})

for document in documents:
print(format_document(document) + '\n')

await geck.stop()

if __name__ == "__main__":
argparser = argparse.ArgumentParser()
argparser.add_argument('--limit', type=int, default=DEFAULT_LIMIT)
args = argparser.parse_args()

geck = StcGeck(
ipfs_http_base_url='http://127.0.0.1:8080',
timeout=300,
)

# Connects to IPFS and instantiate configured indices for searching It will take a time depending on your IPFS performance
await geck.start()

# GECK encapsulates Python client to Summa. It can be either external stand-alone server or embed server, but details are hidden behind SummaClient interface.
summa_client = geck.get_summa_client()

# Match search returns top-5 documents which contain `additive manufacturing` in their title, abstract or content.
search_response = await summa_client.search({
"index_alias": "nexus_science",
"query": {
"match": {
"value": "additive manufacturing",
"query_parser_config": {"default_fields": ["abstract", "title", "content"]}
}
},
"collectors": [{"top_docs": {"limit": 5}}],
"is_fieldnorms_scoring_enabled": False,
})
for scored_document in search_response.collector_outputs[0].documents.scored_documents:
document = json.loads(scored_document.document)
print('DOI:', document['doi'])
print('Title:', document['title'])
print('Abstract:', document.get('abstract'))
print('Links:', document.get('links'))
print('-----')
asyncio.run(main(args.limit))
```

More example for Python can be found in [examples directory](/geck/examples/search-stc.ipynb)
48 changes: 0 additions & 48 deletions geck/examples/search-stc.py

This file was deleted.

2 changes: 1 addition & 1 deletion geck/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"

[project]
name = "stc-geck"
version = "1.8.10"
version = "1.8.14"
authors = [{ name = "Interdimensional Walker" }]
description = "GECK (Garden Of Eden Creation Kit) is a toolkit for setting up and maintaning STC"
readme = "README.md"
Expand Down
8 changes: 7 additions & 1 deletion geck/stc_geck/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,11 @@ async def wrapper_func(self, *args, **kwargs):
except IpfsConnectionError as e:
print(
f"{colored('ERROR', 'red')}: Cannot connect to IPFS: {e.info}\n"
f"{colored('HINT', 'yellow')}: Try to pass working IPFS address with `--ipfs-http-base-url` parameter",
f"{colored('HINT', 'yellow')}: Install IPFS to your computer: "
f"https://docs.ipfs.tech/install/ipfs-desktop/\n"
f"{colored('HINT', 'yellow')}: Also, ensure IPFS is launched\n"
f"{colored('HINT', 'yellow')}: Otherwise, you can pass IPFS address of the working instance with "
f"`--ipfs-http-base-url` parameter: `geck --ipfs-http-base-url http://127.0.0.1:8080 - serve`",
file=sys.stderr,
)
finally:
Expand Down Expand Up @@ -129,6 +133,7 @@ async def random_cids(self, n: Optional[int] = None, space: Optional[str] = None
async with self.geck as geck:
return await geck.random_cids(n=n)

@exception_handler
async def search(self, query: str, limit: int = 1, offset: int = 0):
"""
Searches in STC using default Summa match queries.
Expand All @@ -153,6 +158,7 @@ async def search(self, query: str, limit: int = 1, offset: int = 0):
}
return await summa_client.search_documents(query)

@exception_handler
async def serve(self):
"""
Start serving Summa
Expand Down
5 changes: 4 additions & 1 deletion geck/stc_geck/client.py
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,7 @@ def __init__(
grpc_api_endpoint: str = '127.0.0.1:10082',
index_alias: str = 'nexus_science',
timeout: int = 300,
default_cache_size: int = 300,
):
"""
Constructs GECK that may be used to access STC dataset.
Expand All @@ -114,6 +115,7 @@ def __init__(
endpoint for setting up Summa. If there is Summa listening on the port before launching, then
GECK uses existing instance otherwise launches its own one
:param timeout: timeout for requests sent to IPFS
:param default_cache_size: the CachingDirectory size in bytes
"""
super().__init__()
self.ipfs_http_base_url = canonoize_base_url(ipfs_http_base_url)
Expand All @@ -124,6 +126,7 @@ def __init__(
self.ipfs_data_directory = '/' + ipfs_data_directory.strip('/') + '/'
self.grpc_api_endpoint = grpc_api_endpoint
self.index_alias = index_alias
self.default_cache_size = default_cache_size
self.temp_dir = tempfile.TemporaryDirectory()

self.is_embed = not is_endpoint_listening(self.grpc_api_endpoint)
Expand All @@ -146,7 +149,7 @@ async def start(self):
'method': 'GET',
'url_template': f'{full_path}{{file_name}}',
'headers_template': headers_template,
'cache_config': {'cache_size': 536870912},
'cache_config': {'cache_size': self.default_cache_size},
}}
logging.getLogger('info').info({'action': 'launching_embedded', 'remote_index_config': remote_index_config})
try:
Expand Down

0 comments on commit ce42212

Please sign in to comment.