Ebook Embeddings Search - Colaboratory

Notes on usage:

Make sure to change runtime to GPU.
Upload an epub file representing the ebook you want to search (tip: ever heard of libgen?).
Re-run the last cell using different queries to keep searching the same book.

Optional:

Embeddings for the book you upload will be saved in Files (in the left menu bar) under the title 'embeddings-{first chapter}-{last chapter}-{model name}-{epub filename}.json'.

Download this file and upload it (instead of an epub) on your next runtime session in order to avoid generating the embeddings again.

Run 'process_file' with 'preview_mode' set to True at first to check which range of chapters you want to index. This helps you avoid needlessly creating embeddings for chapters like 'Notes' and 'Works Cited"

<div>!pip install -q ebooklib sentence_transformers from sentence_transformers import SentenceTransformer, util import json import ebooklib from ebooklib import epub from bs4 import BeautifulSoup from os.path import exists from IPython.display import HTML, display import numpy as np import math <br></div>

     |████████████████████████████████| 111 kB 31.3 MB/s
     |████████████████████████████████| 85 kB 4.6 MB/s
     |████████████████████████████████| 5.5 MB 57.8 MB/s
     |████████████████████████████████| 1.3 MB 55.8 MB/s
     |████████████████████████████████| 163 kB 73.3 MB/s
     |████████████████████████████████| 7.6 MB 61.5 MB/s
  Building wheel for ebooklib (setup.py) ... done
  Building wheel for sentence-transformers (setup.py) ... done

def part_to_chapter(part):
    soup = BeautifulSoup(part.get_body_content(), 'html.parser')
    paragraphs = [para.get_text().strip() for para in soup.find_all('p')]
    paragraphs = [para for para in paragraphs if len(para) > 0]
if len(paragraphs) == 0:
return None
    title = ' '.join([heading.get_text() for heading in soup.find_all('h1')])
return {'title': title, 'paras': paragraphs}

min_words_per_para = 150
max_words_per_para = 500

def format_paras(chapters):
for i in range(len(chapters)):
for j in range(len(chapters[i]['paras'])):
            split_para = chapters[i]['paras'][j].split()
if len(split_para) > max_words_per_para:
                chapters[i]['paras'].insert(j + 1, ' '.join(split_para[max_words_per_para:]))
                chapters[i]['paras'][j] = ' '.join(split_para[:max_words_per_para])
            k = j
while len(chapters[i]['paras'][j].split()) < min_words_per_para and k < len(chapters[i]['paras']) - 1:
                chapters[i]['paras'][j] += '\n' + chapters[i]['paras'][k + 1]
                chapters[i]['paras'][k + 1] = ''
                k += 1

        chapters[i]['paras'] = [para.strip() for para in chapters[i]['paras'] if len(para.strip()) > 0]
if len(chapters[i]['title']) == 0:
            chapters[i]['title'] = '(Unnamed) Chapter {no}'.format(no=i + 1)

def print_previews(chapters):
for (i, chapter) in enumerate(chapters):
        title = chapter['title']
        wc = len(' '.join(chapter['paras']).split(' '))
        paras = len(chapter['paras'])
        initial = chapter['paras'][0][:20]
        preview = '{}: {} | wc: {} | paras: {}\n"{}..."\n'.format(i, title, wc, paras, initial)
print(preview)

def get_chapters(book_path, print_chapter_previews, first_chapter, last_chapter):
    book = epub.read_epub(book_path)
    parts = list(book.get_items_of_type(ebooklib.ITEM_DOCUMENT))
    chapters = [part_to_chapter(part) for part in parts if part_to_chapter(part) is not None]
    last_chapter = min(last_chapter, len(chapters) - 1)
    chapters = chapters[first_chapter:last_chapter + 1]
    format_paras(chapters)
if print_chapter_previews:
        print_previews(chapters)
return chapters

def read_json(json_path):
print('Loading embeddings from "{}"'.format(json_path))
with open(json_path, 'r') as f:
        values = json.load(f)
return (values['chapters'], np.array(values['embeddings']))

def read_epub(book_path, json_path, preview_mode, first_chapter, last_chapter):
    chapters = get_chapters(book_path, preview_mode, first_chapter, last_chapter)
if preview_mode:
return (chapters, None)
print('Generating embeddings for chapters {}-{} in "{}"\n'.format(first_chapter, last_chapter, book_path))
    paras = [para for chapter in chapters for para in chapter['paras']]
    embeddings = get_embeddings(paras)
try:
with open(json_path, 'w') as f:
            json.dump({'chapters': chapters, 'embeddings': embeddings.tolist()}, f)
except:
print('Failed to save embeddings to "{}"'.format(json_path))
return (chapters, embeddings)

def process_file(path, preview_mode=False, first_chapter=0, last_chapter=math.inf):
    values = None
if path[-4:] == 'json':
        values = read_json(path)
elif path[-4:] == 'epub':
        json_path = 'embeddings-{}-{}-{}.json'.format(first_chapter, last_chapter, path)
if exists(json_path):
            values = read_json(json_path)
else:
            values = read_epub(path, json_path, preview_mode, first_chapter, last_chapter)
else:
print('Invalid file format. Either upload an epub or a json of book embeddings.')
return values

def print_and_write(text, f):
print(text)
    f.write(text + '\n')

def index_to_para_chapter_index(index, chapters):
for chapter in chapters:
        paras_len = len(chapter['paras'])
if index < paras_len:
return chapter['paras'][index], chapter['title'], index
        index -= paras_len
return None

def search(query, embeddings, n=3):
    query_embedding = get_embeddings(query)[0]
    scores = np.dot(embeddings, query_embedding) / (np.linalg.norm(embeddings, axis=1) * np.linalg.norm(query_embedding))
    results = sorted([i for i in range(len(embeddings))], key=lambda i: scores[i], reverse=True)[:n]

    f = open('result.text', 'a')
    header_msg ='Results for query "{}" in "The Wizard and the Prophet.epub"'.format(query)
    print_and_write(header_msg, f)
for index in results:
        para, title, para_no = index_to_para_chapter_index(index, chapters)
        result_msg = '\nChapter: "{}", Passage number: {}, Score: {:.2f}\n"{}"'.format(title, para_no, scores[index], para)
        print_and_write(result_msg, f)
    print_and_write('\n', f)

Results for query "what areas of the world will be most harmed by climate change" in "The Wizard and the Prophet.epub"

Chapter: "[ SEVEN ] Air: Climate Change", Passage number: 31, Score: 0.66
"The most likely victims of climate change, in the short run, are people who live on oceanic islands, in very low-lying coastal settlements, in ice-bound Arctic communities, and around forests that burn after unwonted dry spells. Millions of people live in these places, but they are a small fraction of the world’s billions. The greatest potential harms of climate change will be experienced by future generations—centuries in the future, or even millennia. By our actions today (burning fossil fuels), the argument is, we are dumping problems (drought, sea-level rise) on tomorrow.
On the one hand, forcing other people to clean up our mess violates basic notions of fairness. On the other hand, actually preventing climate-change problems would require societies today to make investments, some of them costly, to benefit people in the faraway future. It’s like asking teenagers to save for their grandchildren’s retirement. Or, maybe, for somebody else’s grandchildren. Not many would do it."

Chapter: "[ SEVEN ] Air: Climate Change", Passage number: 64, Score: 0.58
"Now look at the numbers for climate change. Humans produce four main types of climate-altering gases: carbon dioxide, methane, nitrous oxide, and a bunch of fluorine-containing gases (these have names like hydrofluorocarbons, perfluorocarbons, and sulfur hexafluoride). Of these, carbon dioxide is the focus of most concern. The other three types actually absorb more infrared radiation, molecule for molecule, but they don’t stay in the air as long (the exception is some of the fluoride-containing gases, but they are not yet present in large quantities). Methane has as much as eighty times the effect on climate as an equivalent amount of carbon dioxide, but a typical methane molecule will only remain in the atmosphere for ten to twenty years. Carbon dioxide molecules, by contrast, will keep floating about for centuries, even millennia. They are a problem that doesn’t go away.
About 85 percent of the world’s carbon dioxide emissions come from fossil fuels, and about 80 percent of those come from just two sources: coal (46 percent) in its various forms, including anthracite and lignite; and petroleum (33 percent) in its various forms, including oil, gasoline, and propane. Coal and petroleum are used differently. Most petroleum is consumed by individuals and small businesses as they heat their homes and offices and drive their cars. By contrast, coal is mainly burned by heavy industry: coal produces the great majority of the world’s steel and cement and 40 percent of its electricity. The percentages vary from place to place, but the pattern remains. Coal provides about two-thirds of China’s energy, but almost all of it is used by big industries. Coal provides less than a fifth of U.S. energy, but again almost all of it is for industry. In both places petroleum consumption is on a smaller, more individual scale."

Chapter: "[ SEVEN ] Air: Climate Change", Passage number: 85, Score: 0.55
"Some economists argue that these figures are exaggerated; indeed, I have cited the researchers’ worst-case scenarios, to emphasize the stakes. But the same economists also point out that some of the most threatened areas are irreplaceable parts of the world’s cultural and natural patrimony. Venice is an obvious example, but so are places like central London, New Orleans and the Mississippi Delta, the vast ancient complex of Chan Chan in coastal Peru, and the great Sundarbans mangrove forests in India and Bangladesh.
To avoid this damage, cities would either have to shift their population to higher ground, construct networks of protective baffles, canals, dikes, and floodwalls, or both. All would be difficult and costly. Shanghai, with an average altitude of thirteen feet, is among the many Asian cities vulnerable to rising waters. Its 14.35 million inhabitants live on the low, flat delta of the Yangzi River. Because the city has withdrawn groundwater too rapidly, it has sunk more than nine feet in the last century. Meanwhile, sea levels are rising. In 1993 the city built a floodwall designed to block the surge from a once-in-a-thousand-years storm; within four years stormwater was lapping at its top. The nearest higher land is about thirty miles away, in the outskirts of the city of Hangzhou, population 2.45 million. Relocating part of Shanghai there would involve building a second or third Hangzhou atop the first."