Avi Bryant Beyond Relational DB's Scaling is easy adding more servers. Easy for web, but less best practice in database. For DB, easy: add users but they talk to their own DB server. Hard if they have to cross communicate (talk to all data, central or otherwise). There are points and counterpoints. Google & GFS + Big Table. Conversely FaceBook runs on MySQL (thousands, or tens of thousands of instances). Basically they're building a distributed database using MySQL. They're probably not using ORM (though Avi doesn't really know). Can also use other services (AppEngine, SimpleDB, SSDS (Micrsoft, though Avi says that it is worth taking a look at.)) Common themes: restricted feature set compared to relational database. * No joins * No grouping * No aggregation * Restricted sorting * Restricted CPU (terminate slow queries rather than continue to run) * Dynamic schema (very interesting): An individual row can have data another row doesn't (similar to SqLite typing (lax typing or whatever it's called)) Not disbled because hard to implement, but hard to iplement at scale. Typical DB: query planner to "compile" query. Some parts are fast (index) some are slow (linear scan). At scale linear scan doesn't make sense (e.g. distributed across machines). Some interesting restrictions. In GQL (Google Query Language): if using an inequality, must also sort by that column and it must be the FIRST sort property. Makes sense since the database has to sort anyway. Interesting trade-off between performance (scale) and developer pain. Annoyance: breaks abstraction & encapsulation, devloper must understand something about storage (leaky abstraction). SimpleDB: another restriction always doing text sorting on query results. Results are just item name. This is useful for concurrency: you can get a result set and then do something concurrency (MapReduce, etc.), see below. Tips: * Concurrency is necessary on large datasets & these stores are designed for that * Need good way to group/sort in memory locally (both good performance and good interface). E.g. proble with our internal Dataset performance, acts as bottle-neck. Not trivial. * Compute on updates not in queries: calculate & aggregate Question: what about caching? Can you cache more since you're calculating on store? Avi thinks maybe less: these engines are set-up for fast, custom queries, so a new query should be fast. Caching also has classical problems, e.g. SimpleDB only has evenntual consistency, it might be minutes before your write appears (though I think from reading the paper, a client favors the same server, so should reflect the write). Eventual consistency can produce version conflicts (I believe that SimpleDB uses versioning & optimistic locking). Avi says that this is something that makes him nervous about SimpleDB. Question: what are the scenarios for use? This will be the focus of the second half of the talk. Avi thinks that in MOST cases, startups should NOT need this: they're too small, or their data is in silos: BaseCamp, FreshBook, etc. An exception is Social networks: the point is to allow cross-connections. Actually, anything that leverages network effect. Tying into previous question: architect your solution to HAVE data in silos, at least for fast operations. That allows so much flexibility (e.g. add more RAM & load all cusotmers data in memory). This requires session affinity (sticky), which isn't hard at all (load balancers have it built in). Still, all the customer's load has to be able to fit on on server, so there is a limit to the size of a single customer. Must remember to write to disk at end (different than DB). "Persistance" pattern (or whatever), e.g. Prevayler. * Every change is Command object (command of change) * Serialize the command object to a transaction log * Checkpoint occasionally * If necessary (crash) can replay transaction log Basically journalled database. Problems: schema migration relies on object versiong (language dependant). Avi thinks that JotSpot is using Prevayler. Tips: GOOD fulltext index: just do broad search and then linear search through smaller results in RAM. This is the approach that DabbleDB uses. As an anecdote, DabbleDB took a weekend to port the data to use SSDS (Microsoft's distributed database) which was easy because they were already going lots in memory and reading from local files. This provides a migration path: start of with Prevayler, text read, etc. and then migrate to a distributed backend (SimpleDB, etc.) You don't need a database in cases where there are lots of users reading the same thing but not on a huge dataset: e.g. Tech.memorandum only requres about 600 megs of core (which is nothing). Also makes writes easy since everything is in memory. ITA/Orbitz also did the same thing: 2 gigs of flights, fares, etc. in memory. Take-away: the cost of RAM has gone down, don't rely on old thinking. Problems with RAM are: transactions, load balancing, and the tipping-point when Data size > RAM size. Transactions can be avoided if you've architected something into silos with few users (e.g. individual DabbleDB database has average of 5 users), Then all you need is standard in memory concurrency: mutexes, queues, etc. Serialize results. Load balancing is solved by session affinity. Large datasets can be solved by getting more RAM or selectively choosing your customers. Demo: MagLev. MagLev is a Ruby VM implemented on GemStone (SmallTalk company) that's designed to be scalable. Includes IRB type REPL. Multiple instanees PRODUCE SHARED STATE!!! You can open a new maglev and use a predefined method, etc. Cache is in memory, so fast. (Only cache global & everything visible from globals, e.g. heap. Not locals. Also transactional (with "BEGIN"/"COMMIT" keywords). Also has commit conflicts, etc. Supports "ABORT" (rollback). Things are also persisted to machine. Garbage collection is also seperate: for instance to collect orphaned things in memory, and for database itself. Difference from Prevayler: don't have to have EVERYTHING in RAM, but can lazily load only what's being used by VM's. This is NOT a horizontal scaling replacement (SimpleDB, etc.), but it is a good replacement for the in memory case: handles transactions, etc. easily. Also solves load balancing: don't need load balancing, since can access the shared state. But requires VM architecture since needs to understand what each VM is doing, so hard to do as a library. System administration: lots of daemons. Control writing things to central memory cache, write to disk, garbage collection, etc. Implementation of guts: Gemstone wrote a VM that Ruby has been adapted to run-on. This is a much different deployment than typical Ruby's apps. (quick syntax example of IRB) >> BEGIN *> X = 1 => 1 *> OMMIT {:commitResult=>:success} Note use of "*>" prompt change. Back to original topic: doesn't doing any of this lose lots of the advantage of databases? They're already tuned for queries and caches, etc. OTOH, they're a different programming model with their own overhead. If you can fit all data in memory, you're still inside you're normal environment with all the advtanges. Avi almost always prefers this approach. There is a sweet spot for a relational database: Dataset is large but won't grow too large (2gigs < x < 10 gigs). This is uncommon for web stuff, but common in enterprise. Also has versioning, legacy, migration advantages: can use database as synchronizaiton point for architecture, across teams, across languages, etc. Some of these can be worked around by synchronizing using languages, etc. Data set size can also be used by segmenting size of client working-set (such as Avi's example in DabbleDB). Also depends on tool support for going from core to/from disk. E.g. SmallTalk is based on snapshotting everything to disk, so it's very fast. (Aside: by comparison a several hundred meg Dataset used >2 gigs of RAM.) In some way is this reimplementing virtual memory? Business point of view: problems with any of these approaches is lock-in, support. E.g. web hosting is based on commodity model: just migrate your entire site. Porting from SimpleDB to SSDS is hard, porting to AppEngine is harder if you're app is not in Python. So you want to be careful who you lock yourself too. Aside: how could something like MagLev be replicated in Python? Pickling? But that has limitations. Gemstone also has lots of tweaks and optimizations (ability to add indices, etc.) that could be exposed to users. Question: the lack of fancy aggregation could result in the transfer of very large datasets across the network. So you're trading query complexity for network bandwidth. Are there any tricks for optimizing? More work on write is one: calculate things you expect to have to fetch on write so there won't be the need to fetch large sets of objects. Question: what sort of library support is there for the big services (AppEngine, SimpleDB, etc.). AppEngine requires the use of Google's Python VM, so it has it's own stuff built in. Not sure about SimpleDB. In general, need to check these thnings out on your own.