77 lines
2.4 KiB
Markdown
77 lines
2.4 KiB
Markdown
|
# Usage
|
||
|
|
||
|
## Running Server
|
||
|
|
||
|
```bash
|
||
|
# Locally
|
||
|
minyma server run
|
||
|
|
||
|
# Docker Quick Start
|
||
|
make docker_build_local
|
||
|
docker run \
|
||
|
-p 5000:5000 \
|
||
|
-e OPENAI_API_KEY=`cat openai_key` \
|
||
|
-e DATA_PATH=/data \
|
||
|
-v ./data:/data \
|
||
|
minyma:latest
|
||
|
```
|
||
|
|
||
|
The server will now be accessible at `http://localhost:5000`
|
||
|
|
||
|
## Normalizing & Loading Data
|
||
|
|
||
|
Minyma is designed to be extensible. You can add normalizers and vector db's
|
||
|
using the appropriate interfaces defined in `./minyma/normalizer.py` and
|
||
|
`./minyma/vdb.py`. At the moment the only supported database is `chroma`
|
||
|
and the only supported normalizer is the `pubmed` normalizer.
|
||
|
|
||
|
To normalize data, you can use Minyma's `normalize` CLI command:
|
||
|
|
||
|
```bash
|
||
|
minyma normalize --filename ./pubmed_manuscripts.jsonl --normalizer pubmed --database chroma --datapath ./chroma
|
||
|
```
|
||
|
|
||
|
The above example does the following:
|
||
|
|
||
|
- Uses the `pubmed` normalizer
|
||
|
- Normalizes the `./pubmed_manuscripts.jsonl` raw dataset [0]
|
||
|
- Loads the output into a `chroma` database and persists the data to the `./chroma` directory
|
||
|
|
||
|
**NOTE:** The above dataset took about an hour to normalize on my MPB M2 Max
|
||
|
|
||
|
[0] https://huggingface.co/datasets/TaylorAI/pubmed_author_manuscripts/tree/main
|
||
|
|
||
|
# Development
|
||
|
|
||
|
```bash
|
||
|
# Initiate
|
||
|
python3 -m venv venv
|
||
|
. ./venv/bin/activate
|
||
|
|
||
|
# Local Development
|
||
|
pip install -e .
|
||
|
|
||
|
# Creds
|
||
|
export OPENAI_API_KEY=`cat openai_key`
|
||
|
```
|
||
|
|
||
|
# Datasets
|
||
|
|
||
|
https://huggingface.co/datasets/TaylorAI/pubmed_author_manuscripts/tree/main
|
||
|
|
||
|
# Notes
|
||
|
|
||
|
- https://docs.pinecone.io/docs/openai
|
||
|
- https://docs.pinecone.io/docs/langchain
|
||
|
- https://docs.pinecone.io/docs/langchain#creating-embeddings
|
||
|
- https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
|
||
|
- https://medium.com/@abhishekranjandev/building-a-speech-recognition-app-with-deepspeech-word2vec-and-pinecone-1e5907d103e2
|
||
|
- https://medium.com/@mishra.thedeepak/doc2vec-simple-implementation-example-df2afbbfbad5
|
||
|
- https://cookbook.openai.com/examples/semantic_text_search_using_embeddings
|
||
|
|
||
|
TODO:
|
||
|
|
||
|
- Build this with Word2Vec / Doc2Vec: https://docs.pinecone.io/docs/openai
|
||
|
- https://radimrehurek.com/gensim/auto_examples/tutorials/run_doc2vec_lee.html#sphx-glr-auto-examples-tutorials-run-doc2vec-lee-py
|
||
|
- https://webcache.googleusercontent.com/search?q=cache:https://medium.com/@rubentak/unleashing-the-power-of-intelligent-chatbots-with-gpt-4-and-vector-databases-a-step-by-step-8027e2ce9e78
|