Jun 222019Obviously not every pacakge on github is going to be available via pip, but downloading and installing manually clutters up your project directory. That kind of defeats the purpose of using pipenv in the first place. However, installing a package by using the git uri with pipenv is possible just like it is with pip. Here's what you type:
pipenv install -e git+git://github.com/user/project.git#egg=<project>
Pretty simple right? Here's an example of one that I've used recently just in case:
pipenv install -e git+git://github.com/miso-belica/sumy.git#egg=sumy
Which is the command to install this package: https://github.com/miso-belica/sumy
If you have pipenv command not found use this to fix it:
sudo -H pip install -U pipenv
for scrapy with Python 3, you'll need
sudo apt-get install python3 python-dev python3-dev \
build-essential libssl-dev libffi-dev \
libxml2-dev libxslt1-dev zlib1g-dev \
python-pip
with Python 2, you'll need
sudo apt-get install python-dev \
build-essential libssl-dev libffi-dev \
libxml2-dev libxslt1-dev zlib1g-dev \
python-pip
Jun 142019Import CSV as Dict
- Creates ordered dict
- You can increase file size limit
- Using next() can bypass the header row
import csv
# Dict reader creates an ordered dict (first row will be headers)
with open('./data/file.csv', newline='') as file:
# Huge csv files might give you a size limit error
csv.field_size_limit(100000000)
results = csv.DictReader(file, delimiter=';', quotechar='*', quoting=csv.QUOTE_ALL)
# next() can help in iterations sometimes
next(results)
for row in results:
# prints each item in the column with header 'key'
print(row['key'])
Import CSV with No Header (nested lists)
- newline='' prevents blank lines
- csv.reader uses indexes [0], [1]
# newline='' prevents blank lines
with open('./data/file.csv', newline='') as file:
results = csv.reader(file, delimiter=':', quoting=csv.QUOTE_NONE)
for row in results:
# csv reader uses indexes
print(row[0])
Writing and Creating Headers
- Create a csv.writer object
- Create header manually before loop
- Nested lists are better than tuples inside lists
- writer.writerow and writer.writerows
# Creates a csv writer object
writer = csv.writer(
file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
# Write header first if you would like
writer.writerow(['title', 'price', 'shipping'])
''' Tuples inside list (list inside lists are usually better though).
If you're using tuples and they are variable size, note that a single tuple
will convert to string type in a loop so indexing it [0] won't work. '''
products = [['slinky', '$5', 'Free'],
['pogo', '$12', '$6'],
['Yoyo', '$7', '$2']]
# write each row normal
for item in products:
writer.writerow(map(str, item))
# Writes all items into a single row
writer.writerow(sum(products, []))
# Writes all 3 rows
writer.writerows(products)
Using DictWriter for Headers
- fieldnames indicates header to object
- writer.writeheader() writes those fields
# DictWriter field names will add the headers for you when you call writeheader()
with open("./data/file.csv", "w") as file:
writer = csv.DictWriter(
file, fieldnames=['title', 'price', 'shipping'],
quoting=csv.QUOTE_NONNUMERIC)
writer.writeheader()
writer.writerows([['slinky', '$5', 'Free'],
['pogo', '$12', '$6'],
['Yoyo', '$7', '$2']])
Bonus - Flatten any List
- Function will flatten any level of nested lists
- or type == tuple() to catch tuples too
# -- Bonus (Off Topic) --
# You can flatten any list with type checking and recursion
l = [1, 2, [3, 4, [5, 6]], 7, 8, [9, [10]]]
output = []
def flatten_list(l):
for i in l:
if type(i) == list:
flatten_list(i)
else:
output.append(i)
reemovNestings(l)
Jun 132019So I'm dealing with this huge database where each item has a bunch of levels and some have different keys than others. So I made a script that takes a list of lists and tries those keys, if they work the function breaks. Pretty simple and I'm sure I could make it cleaner, but it works well enough and I don't expect any 4 level items any time soon. Anyways, here's the code:
for item in items:
''' items is a list of dicts to run the keys on '''
results = {}
def key_attempter(name, key_lists):
nonlocal item
nonlocal results
for key in key_lists:
if len(key) == 1:
try:
results[name] = item[key[0]]
break
except:
pass
elif len(key) == 2:
try:
results[name] = item[key[0]][key[1]]
break
except:
pass
elif len(key) == 3:
try:
results[name] = item[key[0]][key[1]][key[2]]
break
except:
pass
feat_lists = {
'price': [
['hdpData', 'homeInfo', 'price'], ['price'],
['hdpData', 'priceForHDP'], ['priceLabel'],
['hdpData', 'homeInfo', 'zestimate'], ['hdpData', 'festimate']
],
'bed': [
['beds'],
['hdpData', 'homeInfo', 'bedrooms']
],
'bath': [
['baths'],
['hdpData', 'homeInfo', 'bathrooms']
]}
for k in feat_lists.keys():
key_attempter(k, feat_lists[k])
return results
Jun 112019I've been coding again and just remembered how well this website works for keeping track of cool tricks I learn. Sometimes it's really hard to find simple and generic examples of things to help teach the fundamentals. I needed to write to a file without opening the text document 1000 times and I finally found a really clean example that helped me understand the pieces.
Edit** Threadpool is a lot easier and you can thread inside a loop:
from multiprocessing.pool import ThreadPool as Pool
threads = 100
p = Pool(threads)
p.map(function, list)
More complicated version:
import threading
lock = threading.Lock()
def thread_test(num):
phrase = "I am number " + str(num)
with lock:
print phrase
f.write(phrase + "\n")
threads = []
f = open("text.txt", 'w')
for i in range (100):
t = threading.Thread(target = thread_test, args = (i,))
threads.append(t)
t.start()
while threading.activeCount() > 1:
pass
else:
f.close()
Close something on Scrapy spider close without using a pipeline:
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher
class MySpider(CrawlSpider):
def __init__(self):
dispatcher.connect(self.spider_closed, signals.spider_closed)
def spider_closed(self, spider):
# second param is instance of spder about to be closed.
Instead of using an if time or if count to activate something I found a decorator that will make sure the function on runs once:
def run_once(f):
def wrapper(*args, **kwargs):
if not wrapper.has_run:
wrapper.has_run = True
return f(*args, **kwargs)
wrapper.has_run = False
return wrapper
@run_once
def my_function(foo, bar):
return foo+bar
You can also resize the terminal inside the code:
import sys
sys.stdout.write("\x1b[8;{rows};{cols}t".format(rows=46, cols=54))
I got stuck for a while trying to get my repository to let me login without creating an ssh key (super annoying imo) and I figured out that I added the ssh url for the origin url and needed to reset it to the http:
change origin url
git remote set-url origin <url-with-your-username>
Combine mp3 files with linux:
ls *.mp3
sudo apt-get install mp3wrap
mp3wrap output.mp3 *.mp3
Regex is always better than splitting a bunch of times and making the code messy. Plus it's a lot easier to pick up the code later on and figure out what's going on. So I decided to take my regex to the next level and start labeling groups (I'm even going to give it it's very own tag :3:
pat = r'(?<=\,\"searchResults\"\:\{)(?<list_results>.*)(?=\,\"resultsHash\"\:)'
m = re.match(pat, url)
if m:
self.domain = m.group('list_results')
Feb 112019I wanted to come up with a way to scrape domains that have had the most time to cool off first, so I just used a time.time() stamp in the meta (or after the request) and grab the smallest number (the oldest).
class PageSpider(Spider):
name = 'page_spider'
allowed_urls = ['https://www.amazon.com', 'https://www.ebay.com', 'https://www.etsy.com']
custom_settings = {
'ITEM_PIPELINES': {
'pipelines.MainPipeline': 90,
},
'CONCURRENT_REQUESTS': 200,
'CONCURRENT_REQUESTS_PER_DOMAIN': 200,
'ROBOTSTXT_OBEY': False,
'CONCURRENT_ITEMS': 800,
'REACTOR_THREADPOOL_MAXSIZE': 1600,
# Hides printing item dicts
'LOG_LEVEL': 'INFO',
'RETRY_ENABLED': False,
'REDIRECT_MAX_TIMES': 1,
# Stops loading page after 5mb
'DOWNLOAD_MAXSIZE': 5592405,
# Grabs xpath before site finish loading
'DOWNLOAD_FAIL_ON_DATALOSS': False
}
def __init__(self):
self.links = ['www.test.com', 'www.different.org', 'www.pogostickaddict.net']
self.domain_count = {}
def start_requests(self):
while self.links:
start_time = time.time()
url = next(x for x in self.links if min(domain_count, key=domain_count.get) in x)
request = scrapy.Request(url, callback=self.parse, dont_filter=True,
meta={'time': time.time()})
request.meta['start_time'] = start_time
request.meta['url'] = url
yield request
def parse(self, response):
domain = response.url.split('//')[-1].split('/')[0]
self.domain_count[domain] = time.time()
pageloader = PageItemLoader(PageItem(), response=response)
pageloader.add_xpath('search_results', '//div[1]/text()')
self.links.remove(response.meta['url'])
yield pageloader.load_item()
Feb 032019I got tired of trying to write to files because of the write limitations and switched everything over to postgres. Now I remember how well it works, but also how many problems can arise. The cool thing about my recent script is that it solves a lot of issues all in one go. Since I will by completing rows in the DB in multiple increments I had to check if it exists and then check which part to update. In other words I had to SELECT, UPDATE, and INSERT in a few different ways. Here's the code:
import re
import datetime
import psycopg2
import json
with open('./data/database.json') as f:
DATABASE = json.load(f)
class DBTest:
def __init__(self, keyword, results):
self.con = psycopg2.connect(**DATABASE)
self.cur = con.cursor()
self.mkeyword = keyword
self.results = results
self.pg_2 = 'https://www.amazon.com/s/ref=sr_pg_2?rh=i%3Aaps%2Ck%3AXbox+One+Controller+Stand&page=2&keywords=Xbox+One+Controller+Stand'
def updater(self, domain):
if domain == 'www.ebay.com':
self.cur.execute("UPDATE keyword_pages SET ebay_results='" + self.results + "' WHERE keyword='" + self.mkeyword + "'")
self.con.commit()
elif domain == 'www.etsy.com':
self.cur.execute("UPDATE keyword_pages SET etsy_results='" + self.results + "' WHERE keyword='" + self.mkeyword + "'")
self.con.commit()
elif domain == 'www.amazon.com':
self.cur.execute("UPDATE keyword_pages SET amazon_results='" + self.results +
"', amazon_pg2='" + self.pg_2 + "' WHERE keyword='" + self.mkeyword + "'")
self.con.commit()
def test(self):
self.cur.execute("""SELECT * FROM keyword_pages WHERE NOT complete AND amazon_results
!= 'blank' AND ebay_results != 'blank' AND etsy_results != 'blank'""")
rows = self.cur.fetchall()
for row in rows:
print(row[0])
self.cur.execute("select exists(select keyword from keyword_pages where keyword='" + self.mkeyword + "')")
exists = self.cur.fetchone()[0]
if exists:
self.updater('www.etsy.com')
else:
columns = "keyword, amazon_results, amazon_pg2, ebay_results, etsy_results, complete"
values = "'pogo stick', 'blank', 'blank', '14', 'blank', 'f'"
self.cur.execute('INSERT INTO keyword_pages (' + columns + ') VALUES (' + values + ')')
self.con.commit()
self.con.close()
class LinkGen:
def __init__(self):
self.link_pieces = []
self.links = []
self.keywords = {
'extra black coffee': [['www.amazon.com', '4', '/jumprope/s?ie=UTF8&page=2&rh=i%3Aaps%2Ck%3Ajumprope'], ['www.ebay.com', '5'], ['www.etsy.com', '7']],
'decaf coffee': [['www.amazon.com', '5', 'https://www.amazon.com/s/ref=sr_pg_2?rh=i%3Aaps%2Ck%3Ablack+coffee&page=2&keywords=black+coffee&ie=UTF8&qid=1549211788'],
['www.ebay.com', '3'], ['www.etsy.com', '9']],
}
# How Amazon identifies if a link is internal/new or external/old (very simple actually)
def qid(self):
return round((datetime.datetime.today() - datetime.datetime(1970, 1, 1)).total_seconds())
def amazon_gen(self, search_term, page_total, page_2):
self.link_pieces = ['https://www.amazon.com/s/ref=sr_pg_', '?rh=', '&page=', '&keywords=', '&ie=UTF8&qid=']
rh = re.search('rh=([^&|$]*)', str(page_2), re.IGNORECASE).group(1)
print(rh)
all_links = []
for page in range(1, int(page_total) + 1):
all_links.append(
f'{self.link_pieces[0]}{page}{self.link_pieces[1]}{rh}{self.link_pieces[2]}{page}{self.link_pieces[3]}{"+".join(search_term.split(" "))}{self.link_pieces[4]}')
return all_links
def link_gen(self, domain, search_term, page_total):
if domain == 'www.ebay.com':
self.link_pieces = ['https://www.ebay.com/sch/i.html?_nkw=', '&rt=nc&LH_BIN=1&_pgn=']
elif domain == 'www.etsy.com':
self.link_pieces = ['https://www.etsy.com/search?q=', '&page=']
all_links = []
for page in range(1, int(page_total) + 1):
all_links.append(f'{self.link_pieces[0]}{"+".join(search_term.split(" "))}{self.link_pieces[1]}{page}')
return all_links
def test(self):
for keyword in self.keywords.keys():
for results in keywords[keyword]:
if results[0] == 'www.amazon.com':
self.links.append(self.amazon_gen(keyword, results[1], results[2]))
else:
self.links.append(self.link_gen(results[0], keyword, results[1]))
print(self.links)
if __name__ == "__main__":
links = LinkGen()
db = DBTest('pogo stick', '15')
db.test()
Since I had to dig through many other project's code base to figure a lot of this out, not to mention Google, I figured I should put what I collected here so I can find it later.
-- psql
****
sudo apt install postgresql
sudo service postgresql start
sudo su - postgres
createuser --superuser ryan
psql # <- command line tool for making queries
\password ryan
\q # <- exit psql to create new users/dbs or import/export db's (psql is for sql)
createdb ryan # or whatever# exit and now you can run psql in your own console with your username.
***
#start automatically
sudo systemctl enable postgresql
# do database commands
psql -d <database>
alter user ryan with encrypted password <password>;
sudo -i -u ryan
# export
pg_dump -U ryan ebay_keywords > database-dec-18.txt --data-only
# importable export
pg_dump -U ryan ebay_keywords > database-dec-18.pgsql
# Import
psql reviewmill_scraped < database-dec-18.pgsql
CREATE TABLE keyword_pages (
keyword VARCHAR(255) NOT NULL PRIMARY KEY,
amazon_results VARCHAR(16),
amazon_pg2 VARCHAR(255),
ebay_results VARCHAR(16),
etsy_results VARCHAR(16),
complete BOOLEAN NOT NULL
);
ALTER TABLE keyword_pages ALTER COLUMN etsy_results TYPE VARCHAR(16);
INSERT INTO keyword_pages (keyword, amazon_results, amazon_pg2, ebay_results, etsy_results, complete)
VALUES ('extra strong coffee', 12, 'https://www.amazon.com/s/ref=sr_pg_2?rh=i%3Aaps%2Ck%3Ablack+coffee&page=2&keywords=black+coffee&ie=UTF8&qid=1549211788', 12, 4, 'f');
CREATE TABLE reviews (
review_id VARCHAR(30) PRIMARY KEY,
asin VARCHAR(20) NOT NULL
);
ALTER TABLE reviews
ADD CONSTRAINT asin FOREIGN KEY (asin)
REFERENCES products (asin) MATCH FULL;
# Extra stuff
ALTER TABLE reviews ALTER COLUMN asin TYPE varchar(30);
ALTER TABLE reviews ADD COLUMN review_helpful INTEGER;
Windows has been making this increasingly difficult, but I think I've avoided the worst of my connection issues. First Windows decided to automatically detect my proxies, then I find out that my ethernet card driver had some power saving crap on, and I've been having random permission issues between WSL, pipenv, and postgres.
ipconfig /release
ipconfig /renew
I haven't tried this yet, but if my internet starts acting up again this is going to be my first thing to try, considering restarting windows seems to fix it I think this should as well.
Jan 312019You can catch 404 and connection errors by using errback= inside of the scrapy.Request object. From there I just add the failed proxy inside of the request meta to a list of failed proxies inside of the ProxyEngine class. If a proxy is seen inside of the failed list N times it can be removed with the ProxyEngine.remove_bad() class function. I also discovered that passing the download_timeout inside of the request meta works a lot better than inside of the Spider's global settings. Now the spider doesn't hang on slow or broken proxies and will be much much faster.
Next I plan to refactor the ProxyEngine data to serialize attempts so that I can catch proxies that have been banned by one domain, but not others. Also, I need to feed bad_proxies back into the request generator after being down for N time and save all of the proxy data to a database. Here's the code:
Proxy Engine
class ProxyEngine:
def __init__(self, limit=3):
self.proxy_list = []
self.bad_proxies = []
self.good_proxies = []
self.failed_proxies = []
self.limit = limit
def get_new(self, file='./proxies.txt'):
new_proxies = []
with open(file, 'r') as file:
for line in file:
new_proxies.append(f'https://{line.strip()}')
return [self.proxy_list.append(x) for x in new_proxies if x not in self.proxy_list and x not in self.bad_proxies]
def remove_bad(self):
for proxy in self.proxy_list:
if self.failed_proxies.count(proxy) >= self.limit:
self.bad_proxies.append(proxy)
return [self.proxy_list.remove(x) for x in self.proxy_list if x in self.bad_proxies]
Proxy Spider
class ProxyTest(Spider):
name = 'proxy_test'
custom_settings = {
'ITEM_PIPELINES': {
'__main__.ProxyPipeline': 400
},
'CONCURRENT_REQUESTS_PER_IP': 2,
}
def __init__(self):
self.prox = ProxyEngine(limit=20)
def start_requests(self):
self.prox.get_new()
for proxy in self.prox.proxy_list:
request = scrapy.Request("https://dashwood.net/post/python-3-new-string-formatting/456ft",
callback=self.get_title, errback=self.get_error, dont_filter=True)
request.meta['proxy'] = proxy
request.meta['dont_retry'] = True
request.meta['download_timeout'] = 5
yield request
def get_title(self, response):
print(response.status)
print('*' * 15)
def get_error(self, failure):
if failure.check(HttpError):
response = failure.value.response
print("HttpError occurred", response.status)
print('*' * 15)
elif failure.check(DNSLookupError):
request = failure.request
print("DNSLookupError occurred on", request.url)
print('*' * 15)
elif failure.check(TimeoutError, TCPTimedOutError):
request = failure.request
self.prox.failed_proxies.append(request.meta["proxy"])
print("TimeoutError occurred", request.meta)
print('*' * 15)
else:
request = failure.request
print("Other Error", request.meta)
print(f'Proxy: {request.meta["proxy"]}')
self.prox.failed_proxies.append(request.meta["proxy"])
print('Failed:', self.prox.failed_proxies)
print('*' * 15)
Jan 172019Python 3 New String Formatting
I've been using the .format method and explicitly calling variables in case I change the order down the line for all of my code because it just seemed the cleanest way:
print('this is a string {variable}'.format(variable=variable))
But there is a much simpler/cleaner method called f-strings. They are shorter, easier to read, and just plain more efficient. I'll be refactoring any code I come across using the older formatting styles just our of principle. Using this format with the scraping project I worked on this week was a significant convienence and I looooove it. Here's how it works (so simple):
test = 'this'
number = 254
print(f"{test} is a test for number {number}.")
this is a test for number 254.
Jan 072019I've been meaning to do this for a while now, but honestly it's been really difficult to find reference material to copy off of. Fortunatly today I found some really good repositories with almost exactly what I was looking for. Then, after I got it working, I combed the Scrapy docs very slowly and made sure that I understood all of the item loader functions and added simple examples / documentation on most of the features.
One Stand-alone Scrapy Script to Rule Them All
Basically what I wanted was a minimal clean Scrapy script that I could use in other projects without being tied down to the scrapy-cli project crap. I actually feel like I have full control of my script and have been taking great care to organize it correctly. Also, using item loaders / processors is really cool and should open the door to solve issues really cleanly.
Note I added a few interesting features to showcase some of the functionality of item loaders.
#! /usr/local/bin/python3
# -*- coding: utf-8 -*-
from scrapy.crawler import CrawlerProcess
from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst, MapCompose, Join
from scrapy import Spider, Item, Field
from scrapy.settings import Settings
# Originally built off of:
# https://gist.github.com/alecxe/fc1527d6d9492b59c610
def extract_tag(self, values):
# Custom function for Item Loader Processor
for value in values:
yield value[5:-1]
class DefaultAwareItem(Item):
# Converts field default meta into default value fallback
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
# Use python's built-in setdefault() function on all items
for field_name, field_metadata in self.fields.items():
if not field_metadata.get('default'):
self.setdefault(field_name, 'No default set')
else:
self.setdefault(field_name, field_metadata.get('default'))
# Item Field
class CustomItem(DefaultAwareItem):
'''
Input / Output processors can also be declared in the field meta, e.g —
name = scrapy.Field(
input_processor=MapCompose(remove_tags),
output_processor=Join(),
)
'''
title = Field(default="No Title")
link = Field(default="No Links")
desc = Field()
tag = Field(default="No Tags")
class CustomItemLoader(ItemLoader):
'''
Item Loader declaration — input and output processors, functions
https://doc.scrapy.org/en/latest/topics/loaders.html#module-scrapy.loader.processors
Processors (Any functions applied to items here)
Identity() - leaves as is
TakeFirst - Takes first non null value
Join() - basically equivelent to u' '.join
Compose() - applies a list of functions one at a time **accepts loader_context
MapCompose() - applies a list of functions to a list of objects **accepts loader_context \
first function is applied to all objects then altered objects to next function etc..
https://doc.scrapy.org/en/latest/topics/loaders.html#declaring-input-and-output-processors
_in processors are applied to extractions as soon as received
_out processors are applied to collected data once loader.load_item() is yielded
single items are always converted to iterables
custom processor functions must receive self and values
'''
default_input_processor = MapCompose(str.strip)
default_output_processor = TakeFirst()
desc_out = Join()
tag_in = extract_tag # function assigned as class variable
tag_out = Join(', ')
# Define a pipeline
class WriterPipeline(object):
def __init__(self):
self.file = open('items.txt', 'w')
def process_item(self, item, spider):
self.file.write(item['title'] + '\n')
self.file.write(item['link'] + '\n')
self.file.write(item['desc'] + '\n')
self.file.write(item['tag'] + '\n\n')
return item
# Define a spider
class CustomSpider(Spider):
name = 'single_spider'
allowed_domains = ['dashwood.net']
start_urls = ['https://dashwood.net/']
def parse(self, response):
for sel in response.xpath('//article'):
loader = CustomItemLoader(
CustomItem(), selector=sel, response=response)
loader.add_xpath('title', './/h2/a/text()')
loader.add_xpath('link', './/a/@href')
loader.add_xpath('desc', './/p/text()')
loader.add_xpath('tag', './/a[@class="tag"]//@href')
yield loader.load_item()
# Declare some settings / piplines
settings = Settings({
# piplines start with the project/module name so replace with __main__
'ITEM_PIPELINES': {
'__main__.WriterPipeline': 100,
},
'DEFAULT_REQUEST_HEADERS': {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'accept-encoding': 'gzip, deflate, sdch',
'accept-language': 'en-US,en;q=0.8',
'upgrade-insecure-requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3'
},
'DOWNLOADER_MIDDLEWARES': {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,
}
})
process = CrawlerProcess(settings)
# you can run 30 of these at once if you want, e.g —
# process.crawl(CustomSpider)
# process.crawl(CustomSpider) etc.. * 30
process.crawl(CustomSpider)
process.start()
Dec 232018import urllib
params = urllib.parse.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})
f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query?%s" % params)
print(urllib.parse.quote_plus("+kite+/"))
%2Bkite%2B%2F
Note that it's different for python2 -> urllib.quote_plus() & urllib.urlencode()
Nov 092018I finally got around to working on my Amazon project again.
Misc Notes
# Change postgres data directory
File path:
/etc/postgresql/10/main/postgresql.conf
File System Headache
I decided to clean up my hard drives, but I forgot how much of a headache it was trying to get an NTFS drive to work with transmission-daemon. Whatever I'll just save to my EX4 partition for now and fix it later.
*Update
I bricked my OS install and had to go down a 3 hour nightmare trying to fix it. I eventually discovered that it was a label from my old partition mount point in the fstab file. Solution:
sudo nano /etc/fstab
# comment out old label
ctrl + o to save
ctrl + x to exit
reboot
My computer still doesn't restart properly because I broke something in the boot order trying to fix it. Not a big deal I just enter my username/password in the terminal then type startx.
LexSum Progress
Had to slice to 50 for each rating to save time, but I can probably make it longer for launch. At first I was thinking there would be 60 million entities to process, but actually its more like 900k x 5 (for each rating) and as long as I don't lexsum 1000+ reviews for ratings it should finish in a few days. I reallllly need to add a timer function asap. I can just time 1000 or so products and multiply that by 900k or whatever the total number of products in my database is and I should have a pretty good idea how long it will take.
if len(titles) > 50:
titlejoin = ' '.join(lex_sum(' '.join(titles[:50]), sum_count))
textjoin = ' '.join(lex_sum(' '.join(comments[:50]), sum_count))
else:
titlejoin = ' '.join(lex_sum(' '.join(titles), sum_count))
textjoin = ' '.join(lex_sum(' '.join(comments), sum_count))
I'm thinking I can clean these lines up now that I'm staring at it. Maybe something like:
titlejoin = ' '.join(
lex_sum(' '.join(titles[:min(len(titles), 50)]), sum_count))
textjoin = ' '.join(
lex_sum(' '.join(comments[:min(len(titles), 50)]), sum_count))
My estimated time remaining function adds time elapsed ever ten iterations to a list, takes the last 500 or less of that list and averages them, and finally multiplies that average by the total remaining iterations and displays it in a human readable format:
avg_sec = 0
times = []
start = time.time()
# Display time remaining
if avg_sec:
seconds_left = ((limit - count) / 10) * avg_sec
m, s = divmod(seconds_left, 60)
h, m = divmod(m, 60)
print('Estimated Time Left: {}h {}m {}s'.format(
round(h), round(m), round(s)))
if(not count % 10):
end = time.time()
time_block = end - start
start = end
times.append(time_block)
avg_sec = functools.reduce(
lambda x, y: x + y, times[-min(len(times), 500):]) / len(times[-min(len(times), 500):])
print('Average time per 10:', round(avg_sec, 2), 'seconds')
Another thought I had is that this save_df module I coded (it's at like 400 lines of code already x_x) is actually a crucial part of my ultimate code base. I'm pretty happy that I spent so much time writing it into proper functions.
Nov 012018So I ran my summarizer yesterday and it took literally all day to run only 200 products through the lex sum function. So I went through my code and added a timer for each major step in the process like so:
start = time.time()
asin_list = get_asins(limit)
end = time.time()
print('Get ASINs: ', end - start)
Turns out it was taking over 60 seconds per query . I did the math and at the rate it was going, it would take almost two years to complete every product in my database. So I started looking around at different ways to group large databases. Turns out databases are a lot more complicated than I believed. It felt like looking for a PHP solution back in high school when I didn't know enough to know what to look for. Finally I stumbled upon a feature called Indexing. First I added the indexing code inside of my script, which had no effect, but it seemed like it had worked properly. Still though I was not going to give up that easy and I decided to open up postgres directly in the terminal and poke around to see if the indexing was applied properly. Turns out that it was not applied at all. Here is the code I used to index the asin table in reviews:
# Remote Connect
postgres psql -U ryan -h 162.196.142.159 -p 5432 databasename
# Display table Indexes
SELECT * FROM pg_indexes WHERE tablename = 'reviews';
# Create Index
CREATE INDEX asin_index ON reviews (asin);
Ureka! It worked, now the script that took all day to run yesterday ran in about a minute flat! That is the biggest difference in performance time I've ever experienced and I cant wait to see where else indexing will help my databases.
Other than that, Erin showed me a bunch of stuff in illustrator and Phototshop.
- ctrl+click with select tool enables auto-select
- ctrl+d — deselect
- ctrl+shift+i — invert selection
- ctrl+j — duplicate layer
- ctrl+alt+j — duplicate and name layer