top of page

Basic Flow

Abstract Surface

1) Gathering of Urls

Uses web scraping techniques to gather image urls from various websites in formats that the code can use later.

2) Storing of Images

Then asynchronously uses the collected urls to download and label the images efficiently.

3) Data Cleanup

Framework comes with a toolkit allowing ability to easily clean up data for use.

4) Deployment/Data Testing

Basic CNN model ready to be trained on the dataset in order to see rough "learnability".

Under-the-Hood Blocks of Code:

Scrapers

Example code of a few of the functioning scrapers in the script. The methodology for a each scraper block uses the dataset search term, goes to the screen displaying the images and pulls the individual id's. Then each method saves it the formatted urls to a the same giant set.

Run Function

Function that launches all of the separate scrapers. Uses threading to collect data from multiple websites at once and at the end returns the large list of image urls. 

Downloading

Function that will asynchronously iterate through the various image urls and download each image one at a time. As they are being downloaded they will also each be getting assigned unique id numbers. 

Cleanup

Function will remove all corrupted images that potentially could have made it through to prevent future errors. Gets run automatically at the end of the download loop.

cleanup.PNG
bottom of page