fastISM on DeepSEA Beluga¶
fastISM is a faster way to perform in-silico saturation mutagenesis. This tutorial uses the DeepSEA Beluga model (Zhou et al 2018), which predicts 2002 chromatin features for a 2000 bp input sequence. This tutorial covers the following:
- Installations and downloading required files for tutorial
- Benchmarking fastISM on the Beluga model against a standard ISM implementation
- Running fastISM on custom input sequences
- Visualizing fastISM output across all tasks (outputs)
- Selecting a task, visualizing the fastISM scores, and zooming in to visualize the underlying sequence features.
Installations and Data¶
We use pip
to install fastISM
and vizsequence
(to visualize sequence importance scores). In addition we download a trained Beluga model and a tsv with all the model’s outputs.
[4]:
# install fastISM
!pip install fastism
Collecting fastism
Downloading https://files.pythonhosted.org/packages/4f/93/26f83f7197d92b0c502d7f7af32cbd5e0d0f0b52a4bfb51b29162e860fc8/fastism-0.4.0-py3-none-any.whl
Requirement already satisfied: tensorflow<3.0.0,>=2.3.0 in /usr/local/lib/python3.6/dist-packages (from fastism) (2.3.0)
Collecting pydot<2.0.0,>=1.4.1
Downloading https://files.pythonhosted.org/packages/33/d1/b1479a770f66d962f545c2101630ce1d5592d90cb4f083d38862e93d16d2/pydot-1.4.1-py2.py3-none-any.whl
Requirement already satisfied: numpy<1.19.0,>=1.16.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow<3.0.0,>=2.3.0->fastism) (1.18.5)
Requirement already satisfied: wrapt>=1.11.1 in /usr/local/lib/python3.6/dist-packages (from tensorflow<3.0.0,>=2.3.0->fastism) (1.12.1)
Requirement already satisfied: scipy==1.4.1 in /usr/local/lib/python3.6/dist-packages (from tensorflow<3.0.0,>=2.3.0->fastism) (1.4.1)
Requirement already satisfied: grpcio>=1.8.6 in /usr/local/lib/python3.6/dist-packages (from tensorflow<3.0.0,>=2.3.0->fastism) (1.32.0)
Requirement already satisfied: six>=1.12.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow<3.0.0,>=2.3.0->fastism) (1.15.0)
Requirement already satisfied: keras-preprocessing<1.2,>=1.1.1 in /usr/local/lib/python3.6/dist-packages (from tensorflow<3.0.0,>=2.3.0->fastism) (1.1.2)
Requirement already satisfied: astunparse==1.6.3 in /usr/local/lib/python3.6/dist-packages (from tensorflow<3.0.0,>=2.3.0->fastism) (1.6.3)
Requirement already satisfied: absl-py>=0.7.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow<3.0.0,>=2.3.0->fastism) (0.10.0)
Requirement already satisfied: wheel>=0.26 in /usr/local/lib/python3.6/dist-packages (from tensorflow<3.0.0,>=2.3.0->fastism) (0.35.1)
Requirement already satisfied: opt-einsum>=2.3.2 in /usr/local/lib/python3.6/dist-packages (from tensorflow<3.0.0,>=2.3.0->fastism) (3.3.0)
Requirement already satisfied: google-pasta>=0.1.8 in /usr/local/lib/python3.6/dist-packages (from tensorflow<3.0.0,>=2.3.0->fastism) (0.2.0)
Requirement already satisfied: gast==0.3.3 in /usr/local/lib/python3.6/dist-packages (from tensorflow<3.0.0,>=2.3.0->fastism) (0.3.3)
Requirement already satisfied: protobuf>=3.9.2 in /usr/local/lib/python3.6/dist-packages (from tensorflow<3.0.0,>=2.3.0->fastism) (3.12.4)
Requirement already satisfied: tensorboard<3,>=2.3.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow<3.0.0,>=2.3.0->fastism) (2.3.0)
Requirement already satisfied: tensorflow-estimator<2.4.0,>=2.3.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow<3.0.0,>=2.3.0->fastism) (2.3.0)
Requirement already satisfied: termcolor>=1.1.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow<3.0.0,>=2.3.0->fastism) (1.1.0)
Requirement already satisfied: h5py<2.11.0,>=2.10.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow<3.0.0,>=2.3.0->fastism) (2.10.0)
Requirement already satisfied: pyparsing>=2.1.4 in /usr/local/lib/python3.6/dist-packages (from pydot<2.0.0,>=1.4.1->fastism) (2.4.7)
Requirement already satisfied: setuptools in /usr/local/lib/python3.6/dist-packages (from protobuf>=3.9.2->tensorflow<3.0.0,>=2.3.0->fastism) (50.3.0)
Requirement already satisfied: tensorboard-plugin-wit>=1.6.0 in /usr/local/lib/python3.6/dist-packages (from tensorboard<3,>=2.3.0->tensorflow<3.0.0,>=2.3.0->fastism) (1.7.0)
Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in /usr/local/lib/python3.6/dist-packages (from tensorboard<3,>=2.3.0->tensorflow<3.0.0,>=2.3.0->fastism) (0.4.1)
Requirement already satisfied: werkzeug>=0.11.15 in /usr/local/lib/python3.6/dist-packages (from tensorboard<3,>=2.3.0->tensorflow<3.0.0,>=2.3.0->fastism) (1.0.1)
Requirement already satisfied: markdown>=2.6.8 in /usr/local/lib/python3.6/dist-packages (from tensorboard<3,>=2.3.0->tensorflow<3.0.0,>=2.3.0->fastism) (3.2.2)
Requirement already satisfied: google-auth<2,>=1.6.3 in /usr/local/lib/python3.6/dist-packages (from tensorboard<3,>=2.3.0->tensorflow<3.0.0,>=2.3.0->fastism) (1.17.2)
Requirement already satisfied: requests<3,>=2.21.0 in /usr/local/lib/python3.6/dist-packages (from tensorboard<3,>=2.3.0->tensorflow<3.0.0,>=2.3.0->fastism) (2.23.0)
Requirement already satisfied: requests-oauthlib>=0.7.0 in /usr/local/lib/python3.6/dist-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard<3,>=2.3.0->tensorflow<3.0.0,>=2.3.0->fastism) (1.3.0)
Requirement already satisfied: importlib-metadata; python_version < "3.8" in /usr/local/lib/python3.6/dist-packages (from markdown>=2.6.8->tensorboard<3,>=2.3.0->tensorflow<3.0.0,>=2.3.0->fastism) (1.7.0)
Requirement already satisfied: cachetools<5.0,>=2.0.0 in /usr/local/lib/python3.6/dist-packages (from google-auth<2,>=1.6.3->tensorboard<3,>=2.3.0->tensorflow<3.0.0,>=2.3.0->fastism) (4.1.1)
Requirement already satisfied: rsa<5,>=3.1.4; python_version >= "3" in /usr/local/lib/python3.6/dist-packages (from google-auth<2,>=1.6.3->tensorboard<3,>=2.3.0->tensorflow<3.0.0,>=2.3.0->fastism) (4.6)
Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.6/dist-packages (from google-auth<2,>=1.6.3->tensorboard<3,>=2.3.0->tensorflow<3.0.0,>=2.3.0->fastism) (0.2.8)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests<3,>=2.21.0->tensorboard<3,>=2.3.0->tensorflow<3.0.0,>=2.3.0->fastism) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests<3,>=2.21.0->tensorboard<3,>=2.3.0->tensorflow<3.0.0,>=2.3.0->fastism) (2.10)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests<3,>=2.21.0->tensorboard<3,>=2.3.0->tensorflow<3.0.0,>=2.3.0->fastism) (1.24.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests<3,>=2.21.0->tensorboard<3,>=2.3.0->tensorflow<3.0.0,>=2.3.0->fastism) (2020.6.20)
Requirement already satisfied: oauthlib>=3.0.0 in /usr/local/lib/python3.6/dist-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard<3,>=2.3.0->tensorflow<3.0.0,>=2.3.0->fastism) (3.1.0)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.6/dist-packages (from importlib-metadata; python_version < "3.8"->markdown>=2.6.8->tensorboard<3,>=2.3.0->tensorflow<3.0.0,>=2.3.0->fastism) (3.1.0)
Requirement already satisfied: pyasn1>=0.1.3 in /usr/local/lib/python3.6/dist-packages (from rsa<5,>=3.1.4; python_version >= "3"->google-auth<2,>=1.6.3->tensorboard<3,>=2.3.0->tensorflow<3.0.0,>=2.3.0->fastism) (0.4.8)
Installing collected packages: pydot, fastism
Found existing installation: pydot 1.3.0
Uninstalling pydot-1.3.0:
Successfully uninstalled pydot-1.3.0
Successfully installed fastism-0.4.0 pydot-1.4.1
[ ]:
#for visualizing the per-position importance
!pip install vizsequence
Collecting vizsequence
Downloading https://files.pythonhosted.org/packages/a6/10/b3b210eba27de588fba3c261b80317413e18ac3e371df9578b3cdc61096c/vizsequence-0.1.1.0.tar.gz
Requirement already satisfied: numpy>=1.9 in /usr/local/lib/python3.6/dist-packages (from vizsequence) (1.18.5)
Requirement already satisfied: matplotlib>=2.2.2 in /usr/local/lib/python3.6/dist-packages (from vizsequence) (3.2.2)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=2.2.2->vizsequence) (2.4.7)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=2.2.2->vizsequence) (0.10.0)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=2.2.2->vizsequence) (2.8.1)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=2.2.2->vizsequence) (1.2.0)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from cycler>=0.10->matplotlib>=2.2.2->vizsequence) (1.15.0)
Building wheels for collected packages: vizsequence
Building wheel for vizsequence (setup.py) ... done
Created wheel for vizsequence: filename=vizsequence-0.1.1.0-cp36-none-any.whl size=3269 sha256=f5c7dd7708c4186bbdadb6e2e428969bec77fb97b43315a0eb60a4ddf1f6c62f
Stored in directory: /root/.cache/pip/wheels/08/eb/de/6b398b439ba39c278e5c341bdeed57d66280910e096496eaef
Successfully built vizsequence
Installing collected packages: vizsequence
Successfully installed vizsequence-0.1.1.0
[ ]:
# download trained model
! wget http://mitra.stanford.edu/kundaje/surag/fastISM/deepseabeluga_keras_nopermutelayer.h5 -O deepseabeluga.h5
--2020-09-20 06:25:21-- http://mitra.stanford.edu/kundaje/surag/fastISM/deepseabeluga_keras_nopermutelayer.h5
Resolving mitra.stanford.edu (mitra.stanford.edu)... 171.67.96.243
Connecting to mitra.stanford.edu (mitra.stanford.edu)|171.67.96.243|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 598186116 (570M)
Saving to: ‘deepseabeluga.h5’
deepseabeluga.h5 100%[===================>] 570.47M 48.5MB/s in 12s
2020-09-20 06:25:33 (46.9 MB/s) - ‘deepseabeluga.h5’ saved [598186116/598186116]
[ ]:
# download output annotation
! wget https://raw.githubusercontent.com/FunctionLab/ExPecto/20b99d1278678/resources/deepsea_beluga_2002_features.tsv -O outputs.tsv
--2020-09-20 06:25:35-- https://raw.githubusercontent.com/FunctionLab/ExPecto/20b99d1278678/resources/deepsea_beluga_2002_features.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 203001 (198K) [text/plain]
Saving to: ‘outputs.tsv’
outputs.tsv 100%[===================>] 198.24K --.-KB/s in 0.04s
2020-09-20 06:25:36 (5.37 MB/s) - ‘outputs.tsv’ saved [203001/203001]
Init¶
[ ]:
import fastism
import tensorflow as tf
import numpy as np
from matplotlib import pyplot as plt
import pandas as pd
import seaborn as sns
import vizsequence
# for some seaborn warnings
import warnings; warnings.simplefilter('ignore')
import time
[ ]:
tf.__version__
'2.3.0'
[5]:
! pip freeze | grep fastism
fastism==0.4.0
[1]:
!nvidia-smi
Wed Sep 23 09:42:18 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.66 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 00000000:00:04.0 Off | 0 |
| N/A 34C P0 25W / 250W | 0MiB / 16280MiB | 0% Default |
| | | ERR! |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
[ ]:
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
Num GPUs Available: 1
[ ]:
device = 'GPU:0' if tf.config.experimental.list_physical_devices('GPU') else '/device:CPU:0'
device
'GPU:0'
Load Model¶
[ ]:
model = tf.keras.models.load_model("deepseabeluga.h5")
WARNING:tensorflow:No training configuration found in the save file, so the model was *not* compiled. Compile it manually.
[ ]:
model.summary()
Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv1d_1 (Conv1D) (None, 1993, 320) 10560
_________________________________________________________________
activation_1 (Activation) (None, 1993, 320) 0
_________________________________________________________________
conv1d_2 (Conv1D) (None, 1986, 320) 819520
_________________________________________________________________
activation_2 (Activation) (None, 1986, 320) 0
_________________________________________________________________
dropout_1 (Dropout) (None, 1986, 320) 0
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 496, 320) 0
_________________________________________________________________
conv1d_3 (Conv1D) (None, 489, 480) 1229280
_________________________________________________________________
activation_3 (Activation) (None, 489, 480) 0
_________________________________________________________________
conv1d_4 (Conv1D) (None, 482, 480) 1843680
_________________________________________________________________
activation_4 (Activation) (None, 482, 480) 0
_________________________________________________________________
dropout_2 (Dropout) (None, 482, 480) 0
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 120, 480) 0
_________________________________________________________________
conv1d_5 (Conv1D) (None, 113, 640) 2458240
_________________________________________________________________
activation_5 (Activation) (None, 113, 640) 0
_________________________________________________________________
conv1d_6 (Conv1D) (None, 106, 640) 3277440
_________________________________________________________________
activation_6 (Activation) (None, 106, 640) 0
_________________________________________________________________
flatten_1 (Flatten) (None, 67840) 0
_________________________________________________________________
altereddense (Dense) (None, 2003) 135885523
_________________________________________________________________
activation_7 (Activation) (None, 2003) 0
_________________________________________________________________
dense_1 (Dense) (None, 2002) 4012008
_________________________________________________________________
activation_8 (Activation) (None, 2002) 0
=================================================================
Total params: 149,536,251
Trainable params: 149,536,251
Non-trainable params: 0
_________________________________________________________________
[ ]:
model.input.shape
TensorShape([None, 2000, 4])
[ ]:
# a look at the 2002 model outputs
outputs = pd.read_csv("./outputs.tsv", sep="\t")
outputs
Unnamed: 0 | Cell type | Assay | Treatment | Assay type | Source | Unnamed: 6 | |
---|---|---|---|---|---|---|---|
0 | 1 | 8988T | DNase | NaN | DNase | ENCODE | ./wgEncodeAwgDnaseUniform/wgEncodeAwgDnaseDuke... |
1 | 2 | AoSMC | DNase | NaN | DNase | ENCODE | ./wgEncodeAwgDnaseUniform/wgEncodeAwgDnaseDuke... |
2 | 3 | Chorion | DNase | NaN | DNase | ENCODE | ./wgEncodeAwgDnaseUniform/wgEncodeAwgDnaseDuke... |
3 | 4 | CLL | DNase | NaN | DNase | ENCODE | ./wgEncodeAwgDnaseUniform/wgEncodeAwgDnaseDuke... |
4 | 5 | Fibrobl | DNase | NaN | DNase | ENCODE | ./wgEncodeAwgDnaseUniform/wgEncodeAwgDnaseDuke... |
... | ... | ... | ... | ... | ... | ... | ... |
1997 | 1998 | Osteoblasts | H3K4me2 | NaN | Histone | Roadmap Epigenomics | E129-H3K4me2.narrowPeak.gz |
1998 | 1999 | Osteoblasts | H3K4me3 | NaN | Histone | Roadmap Epigenomics | E129-H3K4me3.narrowPeak.gz |
1999 | 2000 | Osteoblasts | H3K79me2 | NaN | Histone | Roadmap Epigenomics | E129-H3K79me2.narrowPeak.gz |
2000 | 2001 | Osteoblasts | H3K9me3 | NaN | Histone | Roadmap Epigenomics | E129-H3K9me3.narrowPeak.gz |
2001 | 2002 | Osteoblasts | H4K20me1 | NaN | Histone | Roadmap Epigenomics | E129-H4K20me1.narrowPeak.gz |
2002 rows × 7 columns
Benchmark¶
Calculate approximate speedup of fastISM vs a naive implementation.
[ ]:
def time_ism(ism_model, batch_sizes, seqlen):
times = []
per_100 = []
for b in batch_sizes:
# using randomly initialized tensors for benchmarking
x = np.random.random((b,seqlen,4))
x = tf.constant(x, dtype=ism_model.model.input.dtype)
t = time.time()
# each position is substituted with 0s for benchmarking
ism_model(x)
times.append(time.time()-t)
per_100.append((times[-1]/b)*100)
print("BATCH: {}\tTIME: {:.2f}\tPER 100: {:.2f}".format(b, times[-1], (times[-1]/b)*100))
print("BEST PER 100: {:.2f}".format(min(per_100)))
[ ]:
fast_ism_model = fastism.FastISM(model, test_correctness=False)
[ ]:
# 512 works with 16Gb GPU memory
time_ism(fast_ism_model, [128, 512], 2000)
BATCH: 128 TIME: 42.58 PER 100: 33.26
BATCH: 512 TIME: 85.09 PER 100: 16.62
BEST PER 100: 16.62
[ ]:
naive_ism_model = fastism.NaiveISM(model)
[ ]:
time_ism(naive_ism_model, [128, 512], 2000)
BATCH: 128 TIME: 142.26 PER 100: 111.14
BATCH: 512 TIME: 513.30 PER 100: 100.25
BEST PER 100: 100.25
Naive ISM takes around 100 seconds per 100 sequence, while fastISM takes around 16.6 seconds. That’s a speedup of about 6x!
Run on Custom Sequences¶
We run fastISM on selected custom sequences, check correctness against a standard implementation and visualize the fastISM importance score matrix across the top 200 outputs.
[ ]:
# chr3:138862274-138864274 (hg38)
chr3_enhancer = "CCGGTGATTTTCTGGAGTCTATATCCTTCATCAGATTTTCCAAGGGGTGTCTGTCCCCTCAAAAGAATGATTGTCATTATTTGAAAGACTAGTTCCAGACAGATATTTTATACAAATTTTCCCAGCATTGACATCCCTGAACCAAACTGTTTTTCTTCCCAACATTACTGTTTTCTTCCTTTCTGTCGAGTTTGTTGTTTTGTAATATCAGAATCTCCAGCTCACCTGAGTAAATGGTAACAAGGTGCCACCACCTTTGAATTCTCCCAGAATCCACCCCACCCTCCGTCAGAGCCACTGCCAAGGCACTCTTACTGATTTCTCCCACACTGCTGGCTCATTGCAAGTGGGAAGACAGCATGTGGAGTGGGTGTGCGGCTTATTAAAGTGAGAACTCAGGGTCAGGGCAGAACCAGGAAGAGAGCAGTGAGATATCCTGCTACCTAATCCAATTCTCCTTTTTGTGCATTTAGCACCCTCCCCTCCGCCTGCATAACAATGGAAGGAAAGAGGAAGTGGGAAAAAAGAAAGTCATGTAATTGAGTTAGAAGAGGTAATGACCAAGACCCTGGAGCAGAGGGAAAGCGGGTTACAAAAGGTGGGTTAAAGAAATCACAAGAGTATGAAGAGCTGGGAAATTACTAACAAATATTTGCTTGTGTGGGAAAGCAAAAAAGTAAAAACTTCAGTGCTGAATTGGGGCGCTGAGCCACCAGGGAAATTTGAGATTGGCATCAAGGACCGTGTTGAAGCAGGGTGGGCGGAGAAGGAGGGAAAACTACCAGCCAGCTGAGATTTTGCAGCTAGGCTGTGGCCTGATACCGAGTATCGATGCCGCAAGGGAGGGATGAGTCAGTCCTAGCACGTCCAAGTTTAGAATAATAGACTGTTTGCCACTGGGAAGGCAAACACCTTTCCTGTGAGAGGGCTTGCTGACAGTTCCAATGTCCAAAGTCCAATGCCGACCCAGAAAACTGAGGAGGCCCTGGCCCCTGCAGGAAGGGCTCATTTACATGGAGACTGAGTAAAGTGCTGTCTTAAACCCTCCTTCCTTCCCCCACTGGGAGGTTTCAGCCAGATATGCCACCCTTTGTAGGATTTCATAGGGTTGTCTAAAGCCAGGGTTGGCACAGAGCAGAAGCCACAGGGCTAAGTACCAGATTATAATTGTCAATGTCACACCTTACTGCAGAAGCCAGGGAAGGGAGCTAGGAAACTGAAGAGCTTTCTTGGTTATGGGCGGGGCTGTAAATGCAGAGTGTGCCCTGGTGACTCATGGGAGACAGTGAGAAACACTGTGGGGATCTGGTCAACCGGGTACTGATTCCTTTGAGGAAGGTATACTCCACATGCCAACCTGATACTCATGGCTAGTGAAGAGATGGCAGGATTGGGTTGCATCAGCCAGCCTAACTCGACTTGGAAACACAGAAAATAACCCAGAGCAGGTCTCAAGCACTGTGTAACTTTATTAGTTCATAGTGGCTGAACAGCCATGTTTAGGGCCTCTCAGAAGAAAGAGTTTCATCTTTGGGAAGAAATTTGTGTTGGGTGATTTTGTTCATATAATTTTGTGTTTTTTGTTTTGTTTTGGTGTTTGAGACAGGGCCTCACTCTCTCACACAGGCTGGAGTGCAGTGGCACCATCTTAGCTCACTGCAACCTCTACCTTCCTGCCTCAAGCGATCCTCCTACTTCAGCCTCCTGCATAGCTGGGACTACAGGCACGTATCACTCAACCCAGCTAATTTTTTTTTTTTCGAGATGCAGTCTTGCTCTGTCACCCAGGCTGGAGAGCAATGGCACTATCTTGGCTCACTGTAACCCCCGCCTCCCAGTCTCTGCCTCCTGAGTAGCTGGGATTACAGGCTCCTGCCACCACCCCCGGCTCAGCTAATTATTTCTTTCTTTCTTTTTTCTGAGATGAAGTTTCACTCTTGTTGCCCAGGCTGGAGTGCAATGGCACGATCTCAGCTCACTGCAATGTCTGCTTCTGGGGT"
# reproduce to have a "batch" of sequences
# you can add more custom sequences to this list
sequences = [chr3_enhancer]*5
#We define a function to do the one-hot encoding
onehot_mapping = {
'A': [1,0,0,0],
'C': [0,1,0,0],
'G': [0,0,1,0],
'T': [0,0,0,1],
'N': [0,0,0,0],
'a': [1,0,0,0],
'c': [0,1,0,0],
'g': [0,0,1,0],
't': [0,0,0,1],
}
def one_hot_encode(sequence):
return np.array([onehot_mapping[x] for x in sequence])
onehot_sequences = np.array([one_hot_encode(x) for x in sequences])
onehot_sequences.shape
(5, 2000, 4)
[ ]:
x = tf.constant(onehot_sequences, dtype=model.input.dtype)
mutations = [[1,0,0,0],
[0,1,0,0],
[0,0,1,0],
[0,0,0,1]]
# get outputs with all mutations (A/C/G/T)
# in general, fastISM gives the most speedup for larger batch sizes
# here we use a small batch size for illustration
fast_ism_out = [fast_ism_model(x, replace_with=mut) for mut in mutations]
naive_ism_out = [naive_ism_model(x, replace_with=mut) for mut in mutations]
[ ]:
# make into a batch_size x seqlen x num_outputs x 4 (bases)
fast_ism_out = np.transpose(np.array(fast_ism_out), (1,2,3,0))
naive_ism_out = np.transpose(np.array(naive_ism_out), (1,2,3,0))
print(naive_ism_out.shape)
(5, 2000, 2002, 4)
[ ]:
# check correctness
print(np.allclose(fast_ism_out, naive_ism_out, atol=1e-6))
print(np.allclose(naive_ism_out, fast_ism_out, atol=1e-6))
True
True
[ ]:
# reference output (should return using ISM itself)
ref = model(x, training=False).numpy()
print(ref.shape)
# repeat to make shape compatible with ism_out (for broadcasting)
ref = np.expand_dims(np.repeat(np.expand_dims(ref, 1), 2000, 1), -1)
print(ref.shape)
(5, 2002)
(5, 2000, 2002, 1)
[ ]:
mut_minus_ref = fast_ism_out - ref
# euclidean distance from reference for each output at each position
mut_minus_ref_euclidean = np.sqrt(np.sum(np.square(mut_minus_ref), -1)/3)
print(mut_minus_ref_euclidean.shape)
(5, 2000, 2002)
[ ]:
# pick index of sequence to explore
SEQ_IDX = 0
[ ]:
# heatmap of importance scores [tasks (clustered) x position]
# for top 200 tasks by reference predictions
top_tasks = np.argsort(ref[SEQ_IDX][0][:,0])[::-1][:200]
to_plot = mut_minus_ref_euclidean[SEQ_IDX][:, top_tasks].T
cg = sns.clustermap(to_plot,
row_cluster=True, # cluster tasks
col_cluster=False, # not position
# standard_scale=0,
dendrogram_ratio=0.04,
figsize=(40,8),
xticklabels=100,
vmax = np.quantile(to_plot, 0.998), # set max limit
cbar_pos=(0, .3, .02, .4))
cg.ax_row_dendrogram.set_visible(False)
Positions in the input have different importance scores depending on the task. For example, regions around 500 bp have high importance scores for the top few tasks but not for the rest of the tasks.
Sequence Features for a Single Output¶
Let’s pick a task and zoom in to visualize the underlying sequence with high importance scores for that task.
[ ]:
# pick task with highest prediction
TASK_IDX = np.argsort(ref[SEQ_IDX][0][:,0])[::-1][0]
# or pick another task of your choice
# TASK_IDX = 1777
TASK_IDX
1777
[ ]:
# predicted value for that task
ref[SEQ_IDX, 0, TASK_IDX, 0]
0.98828876
[ ]:
# output (task) description
outputs.iloc[TASK_IDX]
Unnamed: 0 1778
Cell type Small_Intestine
Assay DNase.hot
Treatment NaN
Assay type DNase
Source Roadmap Epigenomics
Unnamed: 6 E109-DNase.hot.bed.gz
Name: 1777, dtype: object
The output is DNase-seq prediction in small intestine.
[ ]:
# heatmap of base x position difference from ref
plt.figure(figsize=(40,5))
to_plot = mut_minus_ref[SEQ_IDX, :, TASK_IDX].T
sns.heatmap(to_plot,
center=0,
vmin = np.quantile(to_plot, 0.0001),
vmax = np.quantile(to_plot, 0.9999),
xticklabels = 100,
yticklabels = ['A','C','G','T'],
cbar_kws = dict(use_gridspec=False,location="top"))
plt.show()
plt.figure(figsize=(40,2))
plt.plot(mut_minus_ref_euclidean[SEQ_IDX, :, TASK_IDX])
plt.margins(0, 0)
plt.show()
Let’s zoom into the central bases and visualize the sequence as well.
[ ]:
# Zoom in heatmap of base x position
ZOOM_IN = range(800, 1000)
plt.figure(figsize=(40,5))
to_plot = mut_minus_ref[SEQ_IDX, ZOOM_IN, TASK_IDX].T
sns.heatmap(to_plot,
center=0,
vmin = np.quantile(to_plot, 0.0001),
vmax = np.quantile(to_plot, 0.9999),
xticklabels = 10,
yticklabels = ['A','C','G','T'],
cbar_kws = dict(use_gridspec=False,location="top"))
plt.show()
plt.figure(figsize=(40,2))
plt.plot(mut_minus_ref_euclidean[SEQ_IDX, ZOOM_IN, TASK_IDX])
plt.margins(0.01, 0.01)
plt.show()
vizsequence.plot_weights(
np.expand_dims(mut_minus_ref_euclidean[SEQ_IDX, ZOOM_IN, TASK_IDX], -1)*onehot_sequences[SEQ_IDX, ZOOM_IN],
figsize=(40,2))
On the left is an AP1 motif and on the right is HNF4 motif. HNF4 has a known role in intestinal development.