Open Tech Challenges
1. Item Deduplication
We crawl deals and products across different affiliate networks and ecommerce sites. We need a way to dedup or consolidate same items.
- Dedup - if same items appear at different sites
- Consolidate - some items that are the same but they are different in size or color and so is the price.
Challenges
We can build a vector space for a set of features extracted from a product (including term vector from text field) and calculate similarity of each pairs using cosine similarity. However, given we have ove 1M products in database, this approach is caused us to compare 1M*1M times and the performance is very slow.
Reference
- https://www.mitre.org/sites/default/files/pdf/08_0238.pdf
- SimHash Paper
- https://github.com/larsga/Duke (use Java with Lucene!)
2. Extract Product Info from Merchant Detail Page
When user play the product recognition game for our social image, they can simply paste the detail page URL and our system should extract the product info like image, title, desc and price. The same way you use URL to facebook and it will help you to pull the content info for sharing.
Challenges
There are 2 major challenges here:
Dynamic page: if the page is not rendered by server or static but leverages ajax to dynamically pull content from server, you need to render the page as browser that runs javascript to obtain the content before parsing. We can use phantomjs to simulate the browser behavior but it runs one job per process. It may not be effective if you have over thousands of pages like this per crawl.
Accuracy and Scale: if you need to build the generic product parser that can't hardcode the xpath and regex for each field of product on a particular merchant page, how can you achieve the accuracy in this automatic fashion?
Reference
3. Identify simliar images for search
We want to provide photo search feature for our publisher. For example, user take a photo of an new chair from his friend's house and use it to search against Amazon, you should suggest them similar products on Amazon (color, style, price etc). How to do it fast given you have over millions of products?
We can dramatically improve the speed through Locality Sensitivity Search and improve tune its accuracy with speed tradeoff thru forest tree implementation like Lyst. Have you ever done this before?
4. Keyword Categorization
5. Domain/URL Categorization
First we need to derive a set of keyword from URL and use keyword categorization tool to find out the categories.
6. Keyword Suggestion/ Expansion
7. Hashtag Correlation
We have crawled bunch of social content from facebook, twitter and etc. Those social content normally comes with hashtags. If users give us a hashtag, how can our system suggests similiar hashtags thru the social content we aggregate?
8. Trend spotting
We want to spot content getting popular per niche and recommend these to your users like how what buzzfeed has done before? What is the best approach we should take?
9. Spot Tweets With Commerce Value
We want to spot tweets that we can recommend products from Amazon like "My TV is broken last night..." and our engine should auto-reply with a TV deal from Amazon or other affiliate network.
Challenges
- Parse tweet and identify users has a need for a TV but not to tweet simply saying "I see Tom Cruise on the TV show in HBO last night".
- When tweet with commercial value is spotted, we want to generate it as human friendly message as reply. And this reply message cannot be exactly the same as we may send out quite a bit per day and we need to fly under the radar of twitter.
10. Summarize the reviews like appcrawlr.com
- Sentiment analysis
- Feature grouping