Building a mobile app API using Drupal, Node.js and MongoDB
In February 2012 our team at ConsumerSearch launched the ConsumerSearch Reviews iOS app. This handsome app helps you during your product purchase by providing extensive product reviews, comparisons, pricing information and online and local store product availability. Try it out during your next purchase -- you'll be surprised how much time you'll save in researching the best products.
Searching for products is possible by keyword search or barcode scanner. We are very proud of our first app that has already gone through several updates and has received positive user reviews.
For a development team that historically had a huge focus on PHP and Drupal development, big technology changes were introduced that took us out of our comfort zone.
Of course we all know about the latest technologies, however making production ready code in a short amount of time is always a risk and a challenge (that we welcome :)).
I feel that many (Drupal) development teams are facing similar questions when moving to new technologies, so I hope this use case can give developers familiar with PHP and Drupal some quick guidance or fresh ideas. Feel free to ask me questions or provide feedback in the comments section.
ConsumerSearch is powered by Drupal
It's no secret that ConsumerSearch has been one of the earliest examples of highly trafficked Drupal sites since it switched to Drupal in 2008.
We manage to serve our huge traffic load by relying HEAVILY on caching (browser, Drupal, memcache, and Akamai).
This reliance on heavy caching in order to deliver super fast performance is one of the weaker points of our (and many other) websites, especially in a world where we aim for real time and personalized experiences.
While creating the architecture and technical specification for our API backend we realized that some drastic changes had to be introduced in order to create a highly performant and flexible API that could serve as a data source for the mobile application and for additional users.
We eventually settled on the following technology:
- Drupal as our CMS which handles all the content creation. Drupal pushes updated and new content to MongoDB and Solr.
- MongoDB as our intermediate storage layer to where we push all Drupal content in a heavily denormalized way. The API application layer queries MongoDB directly.
- Acquia Search powered by Solr as the search index. This powers search related API calls.
- Node.js as the API layer. We use Pronto.js for our lightweight, high-performance node.js framework.
This gives us a nice open source stack that plays very well together in creating a flexible and highly performant content delivery system.
As a side note, our team had limited to no experience with Node.js and MongoDB before working on the iOS app. Even today these technologies only account for a very small part of our day to day business. However both solutions proved to be relative easy to learn and master.
Speeding up and choosing the right technologies
We've been using Drupal 6 for the past couple of years and benefitted greatly from the flexibility of its content model (CKK), its presentation possibilities (views / templating engine) and the huge amount of contributed modules available (too many to name). However, where these tools have given us incredible speed during initial construction of the site, we have gradually experienced some of the unfortunate drawbacks of Drupal (which are also present in many other Content Management Systems). Some of these drawbacks include:
- Speed and performance: We need to rely heavily on caching and scale out our web server infrastructure in order to provide speedy pages. Part of the speed problem is the heavy customization of our website (our fault), but also the nature of the Drupal architecture.
- Module compatibility: As an example, we have a robust deploy process that depends on the deploy module. However, this module is not compatible with the new and recommended Services 3.x module.
- Module scope: Very often, Drupal modules try to do too much. I have also been guilty of this while, for example, working on the first versions of the Mobile Tools module. If you want fast and lightweight, you want focused modules. Often you have to write the functionality yourself or find a different module/solution.
Eventually our research on using Drupal as a fast and lightweight API layer came down to the fact that Drupal was just not lightweight, fast or simple enough. It can, of course, be tuned so that it is some or even all of those things, but why choose the hard way when other technologies can give you those features out of the box? We value simplicity and focus within our development team and projects, but Drupal, unfortunately, is just not always simple or focused.
With speed, flexibility, and simplicity as our goals we made the following assumptions:
- Storing our data in a denormalized fashion should speed up the the retrieval process.
- Storing the data schema-less should result in shorter development time.
- Using very dedicated code should make the application lightweight and fast.
Further, the API would have the following characteristics:
- The database should store simple documents with no revisioning (Drupal keeps track of that) since our content doesn't frequently change.
- The API should focus on getting data from the database -- no write operations needed.
- The data should be replicated across multiple data centers.
- The architecture should easily scale out across multiple data centers.
Denormalization of the data
Loading content stored in Drupal can be very DB intensive due to the underlying content storage model. Data is scattered in many tables, and when you rely on node references you end up making a lot of DB calls. So we decided to preprocess our data and store complete data objects in a denormalized way so retrieval becomes easy and fast.
For this we wrote a custom module, loosely based on the Apache Solr Drupal module that publishes some of our content types in a denormalized way in MongoDB. Some of the transformations we performed include:
- removing fields that were only used internally or specific to Drupal.
- giving fields Drupal-agnostic names.
- collapsing some of the CCK arrays (e.g. "field_name[value] into "field_name")
- other content tweaks that made the resulting data object easy to understand.
We ended up with a series of collections for our primary objects: reports, products, channels, authors, etc.
MongoDB as a document store
It did not take much research to determine that MongoDB was an ideal choice as the database to store our denormalized data. Some of our key reasons for choosing MongoDB were determined by the fact that:
- JSON is the native format to store and retrieve data. JSON is currently the DE FACTO standard for web APIs.
- as a document store, it is natural for MongoDB to store denormalized nodes.
- querying MongoDB is easy and intuitive. Although different than SQL it is very easy to dynamically create queries.
- MongoDB has good driver support in many different languages and inv.
- it provides easy replication using Replication Sets, resulting in read scalability and fail over.
- reads and writes are scalable through Sharding.
- it's lightning fast
- and on and on.
While researching other storage databases, we learned that MongoDB isn't perfect at everything, but it is more than good enough for our needs. Some of its missing features include:
- key-value stores like Cassandra and Reddis can reach much higher speeds.
- no master-master replication. In our case that would mean having the possibility to have two active master data centers.
- no document versioning like in CouchDB.
For each content type that we needed access to in our API we created a MongoDB Collection. We also setup a replication set across our our two data centers with an additional arbiter in a third datacenter.
The replication set gives us read operations across all the machines in the set, while writes still go to the primary database. In order to scale writing, sharding is the recommended technique.
Node.js as the API application layer
While it was clear that Drupal would never give us the required speed for our API (unless relying 100% on cached responses) and that MongoDB was an easy scalable solution to store our data, heavy debating occurred in choosing the application layer for our API. Would we stick with our familiar PHP environment, or would we explore other ways (Ruby, Node.js, etc.)?
The main high-level requirements for the API in our case were:
- simple routing of API requests; no templating, CMS functionality, or heavy administration.
- the ability to handle many I/O operations, including connecting to MongoDB and various third-party data providers.
- the need to minimize dependencies on the content model in Drupal. For example, adding a field to a Drupal node should not result in changes to the API.
- the ability to handle JSON well.
- it needed to be as fast as possible.
- it must be easy to learn, or at least minimize the learning curve for the team.
We narrowed the list down to PHP (we all know that one very well) and Node.js (new kid on the block, we heard many great things about it, etc.).
Many of the Node.js related discussion in our team (and elsewhere) focused on the "blocking" versus "non-blocking" nature of Node.js and on its internals. I imagine that this discussion will be one of the most popular ones within teams that are new to Node.js. I'm going to try and summarize these discussions using the following examples:
Node.js uses an event-driven, non-blocking I/O model. This means that I/O bound operations such as external connections, accessing a database and file operations do not block the execution of the main thread. Instead they are executed by a low-level Node.js thread pool and when completed are queued back in the main thread for further handling.
On top of that Node.js runs by default in one thread on the server that sequentially handles all incoming requests, returning callbacks and non I/O related processing logic. An example would be an algorithmic complex calculation (like a Fibonacci sequence).
In the following diagram you can see what is happening in a typical Node.js application where:
- there is one main thread
- there is a low-level (and very lightweight, low-overhead) thread pool that handles the non-blocking I/O operations.
- incoming requests and finished I/O calls are put on the event queue and await processing by the Node.js main thread.
As a result, processing in the main thread can block your entire application (which is very bad). Luckily, workarounds are available to move complex operations out of the main execution thread.
2. Apache with PHP (Prefork)
In Apache, when using mod_php and assuming you use Prefork MPM (because PHP is not thread safe), every incoming request spins up a new process in a different thread. This means that if one thread hangs, it will not block the other threads. This is one of the strong benefits in using PHP and Apache.
However, I/O bound operations are blocking since the execution of further code waits until the I/O operation has finished (see dashed lines). This waiting reserves system resources that are heavily underused. Spinning up new processes also results in overhead on your system.
In a nutshell, you could say that Apache with PHP could be seen as a safe bet, but not the most performant one.
In the case of the API, we did not have complex calculations and the main part of the execution consisted of connecting to MongoDB for authentication and getting data back. Node.js was clearly a winner.
For the API we settled on Pronto.js, a great and highly performant Node.js framework developed by long time Drupal developer Matt Butcher while he was a member of our team at ConsumerSearch.
Pronto.js is in many ways similar to the well known Connect framework. Pronto.js provides an easy interface to create chainable commands to execute on a request.
Many tricks are used to minimize the blocking nature of incoming requests. We've since made enhancements to Pronto.js that provide non-blocking execution. I'll cover that in a subsequent blog post if it warrants discussion.
Wrapping it all up
We have been running this setup now for almost 5 months with no problems. Development was fast and lightweight.
We experience super fast response times and very low server load. Benchmarks show that we can easily hold 1000 concurrent connections serving 4000 requests/second on two mid range servers. We actually need to scale up some of our testing infrastructure to explore the real limits :) Try our free app and see for yourself.
Further optimizations we are exploring include:
- using Cluster in order to span multiple Node.js threads.
- caching responses (something we still want to avoid as much as possible).
- integrating Mongoose to get some minimal schema definition.
- sharding of our MongoDB server.
- compressing the json responses.
- adding more unit tests.
Please comment to share your experiences with these technologies or if you have any questions or comments. You can also follow me on Twitter: @twom.
Special thanks go out to Treehouse Mobile for collaborating with us for creating the iPhone app, and our development team for suggestions and corrections on this blog post!