AWS Book Finder
Book Finder is a simple OCR app that allows a user to upload images and search for text contained in them. Its original purpose was to search through a disorganized library of books to find specific keywords (hence the branding), but it obviously has more potential uses.
GITHUB ➤
I built this app to learn AWS serverless technology. Here's a quick list of the AWS services that I used:
- CloudFormation
- API Gateway
- Cognito
- Lambda
- AWS JS SDK V3
- AWS CLI
- CloudWatch
- S3
- S3 Lambda trigger
- DynamoDB
- Rekognition
- IAM
- SES
- Signed URLs
Additionally, this project was my first exposure to container-based development. My dad (who also contributed to the Rekognition portion of the project) helped me set up a dev container in VS Code. He also created a Makefile that made deployment and testing easy, which is one of the reasons I'm currently learning shell. Lambda functions are zipped and stored in a versioned S3 bucket to detect code changes in successive builds.
Eventually, I'd like to learn CDK or SAM, but my dad and I thought we'd get more value out of the project by learning the underlying technology first. As such, the entire AWS stack is written in CloudFormation. Writing it this way was difficult at times, but being exposed to the full set of parameters that can be passed to each of these services really helped me to understand the overall structure and how the pieces fit together.
FRONT-END
As I mentioned earlier, the focus of this project was not the front-end. Despite that, I learned a lot in the process, including the canvas API, UI/UX refinements, interacting with the local filesystem and localStorage, FOUC, and tracking user state across an app. It's all written in vanilla HTML/CSS/JS - no frameworks or libraries.
The most difficult portion of the front-end was the image upload page. A photo of a bookshelf might contain parts of other bookshelves, so I added the ability for a user to crop their images prior to upload to avoid tainting the search results with duplicates. This really forced me to learn how to use the canvas. Managing event listeners and intermediate states throughout the process was more difficult than I thought it would be. Designing a clean, intuitive UI for this process was also a significant challenge - hopefully I succeeded.
The Library page was also fairly difficult to get right. I went through a few iterations of displaying search results. Initially I tried simple flashing outlines, but they were difficult to see on the colorful, busy images I was using, especially if the words were small. I contemplated using a mask, but decided that the surrounding context should be clearly visible alongside the highlighted result. I settled on a mixed strategy - results are highlighted by bright pulsing outlines, and the image fades in and out of grayscale to make the outlines easier to see without obscuring their context (sorry to the colorblind out there - this is on my list of things to improve).
SECURITY
This project was my first real taste of cybersecurity. Unlike previous projects, this has my credit card attached to it, and as far as Amazon's concerned, I'm liable for all execution time. I was fairly anxious about putting something like this out there for the first time. You never know when your app is going to end up in front of someone who wants to financially ruin you for the thrill of it. You also never know when you're going to accidentally write an infinite loop into your app and sign out for the night while AWS happily chugs away to the tune of thousands of dollars.
That's almost what happened to me and my dad, even after we had discussed the risk and shared cautionary tales. We had set up a bucket to trigger a Lambda when a new object was uploaded, but that Lambda was mistakenly feeding back into the bucket. Luckily, we caught the loop when we saw the massive amount of CloudWatch logs being generated and managed to delete-stack
before it got out of hand.
To avoid an expensive lesson, I focused on minimizing or eliminating as many attack vectors as I could think of (that list is by no means comprehensive - this is not an invitation to find more). I wasn't afraid to kill a fly with a sledgehammer here, so my API Gateway is strictly rate-limited with a usage plan. This is obviously not ideal or realistic for anything but a demo app, but it does what I need it to do. Real applications require more robust DDOS protection, such as API Gateway's WAF or another similar solution.
A Cognito authorizer is used to authenticate all requests other than signups. For signup itself, I attempted to implement AWS SES, but ultimately chose to keep things simple and stuck with the stock verification emails sent by Cognito (50 per day, so security is built in). S3 storage is limited per-user (20MB maximum filesize, 50MB total storage space per user). Uploads and downloads are done via short-duration signed URLs to prevent abuse. Signed URLs also conveniently bypass the more complicated process of passing large amounts of data through API Gateway and Lambda.
IMAGE PROCESSING
This is the main part of the project that my dad contributed to, aside from the Makefile and some pair programming. He worked on graphics for many years, so we brainstormed for a while and he wrote it. When a user uploads an image, S3 triggers a Lambda function that processes the image. The function feeds the image to Rekognition, which returns as many words as it can find in a single pass. This is limited to around 50 words - in hindsight, the more lightweight Textract probably would have sufficed for this project, but Rekognition does really well with odd fonts and perspectives.
The function then takes the returned words' polygons and overlays them on the original image, producing a “whiteout image”. This is done with a node.js HTML canvas Lambda layer that my dad found online (this is why I can't update the Lambda runtime to the most recent version, which comes pre-loaded with the V3 JS SDK). These images are stored temporarily in a separate bucket during processing and are automatically deleted later.
The whiteout image is then submitted to Rekognition, and this process repeats until no new results are returned. This can take over a minute for images with a lot of text, but for images like the ones I used for the demo, it takes around 10-15 seconds depending on Lambda cold-start. Accounting for the processing time is one of the shortcomings of the project. Ideally, the Library page would have a webhook listening for completion - definitely something for a future version. Once processing is finished, the Lambda uploads the data to DynamoDB, which can then be accessed via the Library page.
THE FUTURE
I'd like to continue to use this project to expand my skills. Here's a list of potential changes I've come up with so far:
- Create mobile version
- Use SES to send user emails
- Allow zoom in/out on Library images
- Improve search result visibility
- Allow user to delete account
- Display a list of all detected words for each image
- Improve search functionality (multi-word, best fit, etc.)
- Improve upload UI
- Add webhook for image processing completion to Library page
- Rebuild with React and host in AWS Amplify
- Rebrand to reflect more generic OCR usage
- Better error handling
- Add password reset option
Thank you for reading! If you have any questions, feel free to contact me or submit an issue.