Skip to content
/ datashare Public
forked from ICIJ/datashare

Better analyze information, in all its forms

License

Notifications You must be signed in to change notification settings

7u4/datashare

 
 

Repository files navigation

Datashare

Circle CI

Download

https://datashare.icij.org/

Documentation

Datashare's user guide can be found here: https://icij.gitbook.io/datashare/

Description

Datashare is a free open-source desktop application developed by non-profit International Consortium of Investigative Journalists (ICIJ).

Datashare allows investigative journalists to:

  • access all their documents in one place locally on their computer while securing them from potential third-party interferences
  • search pdfs, images, texts, spreadsheets, slides and any files, simultaneously
  • automatically detect and filter by people, organizations and locations

Installing and using

Using with elasticsearch

You can download the script at datashare.icij.org.

To access web GUI, go in your documents folder and launch path/to/datashare.sh then connect datashare on http://localhost:8080

Using only Named Entity Recognition

You can use the datashare docker container only for HTTP exposed name finding API.

Just run :

docker run -ti -p 8080:8080 -v /path/to/dist/:/home/datashare/dist icij/datashare:0.10 -m NER

A bit of explanation :

  • -w tells datashare to run the webserver. It is launched on 8080 that's why the port is mapped for docker
  • -m NER runs datashare without index at all on a stateless mode
  • -v /path/to/dist:/home/datashare/dist maps the directory where the NLP models will be read (and downloaded if they don't exist)

Then query with curl the server with :

curl -i localhost:8080/ner/findNames/CORENLP --data-binary @path/to/a/file.txt

The last path part (CORENLP) is the framework. You can choose it among CORENLP, IXAPIPE, MITIE or OPENNLP.

Extract Text from Files

Implementations

Support

Tika File Formats

Extract Persons, Organizations or Locations from Text

Implementations

  • org.icij.datashare.text.nlp.corenlp.CorenlpPipeline

    Stanford CoreNLP v3.8.0, (Conditional Random Fields), Composite GPL v3+

  • org.icij.datashare.text.nlp.ixapipe.IxapipePipeline

    Ixa Pipes Nerc v1.6.1, (Perceptron), Apache Licence v2.0

  • org.icij.datashare.text.nlp.mitie.MitiePipeline

    MIT Information Extraction v0.8, (Structural Support Vector Machines), Boost Software License v1.0

  • org.icij.datashare.text.nlp.opennlp.OpennlpPipeline

    Apache OpenNLP v1.6.0, (Maximum Entropy), Apache Licence v2.0

Natural Language Processing Stages Support

NlpStage
TOKEN
SENTENCE
POS
NER

Named Entity Recognition Language Support

NlpStage.NER ENGLISH SPANISH GERMAN FRENCH CHINESE
NlpPipeline.Type.CORENLP X X X (w/ EN) X
NlpPipeline.Type.OPENNLP X X - X -
NlpPipeline.Type.IXAPIPE X X X - -
NlpPipeline.Type.MITIE X X X - -

Named Entity Categories Support

NamedEntity.Category
ORGANIZATION
PERSON
LOCATION

Parts-of-Speech Language Support

NlpStage.POS ENGLISH SPANISH GERMAN FRENCH
NlpPipeline.Type.CORE X X X X
NlpPipeline.Type.OPEN X X X X
NlpPipeline.Type.IXA X X X X
NlpPipeline.Type.MITIE - - - -

Store and Search Documents and Named Entities

Implementations

  • org.icij.datashare.text.indexing.elasticsearch.ElasticsearchIndexer

    Elasticsearch v6.1.0, Apache Licence v2.0

Compilation / Build

Requires JDK 8, Maven 3

From datashare root directory, type: mvn package

License

Datashare is released under the GNU Affero General Public License

Feedback

We welcome feedback as well as contributions!

For any bug, question, comment or (pull) request,

please contact us at [email protected]

About

Better analyze information, in all its forms

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Java 98.8%
  • Other 1.2%