solr4_nosql_search_server_2013

advertisement
Solr 4
The NoSQL Search Server
Yonik Seeley
May 30, 2013
NoSQL Databases
• Wikipedia says:
A NoSQL database provides a mechanism for storage and retrieval of data that
use looser consistency models than traditional relational databases in order to
achieve horizontal scaling and higher availability. Some authors refer to them as
"Not only SQL" to emphasize that some NoSQL systems do allow SQL-like query
language to be used.
• Non-traditional data stores
• Doesn’t use / isn’t designed around SQL
• May not give full ACID guarantees
• Offers other advantages such as greater scalability as a
tradeoff
• Distributed, fault-tolerant architecture
2
2
Solr Cloud Design Goals
• Automatic Distributed Indexing
• HA for Writes
• Durable Writes
• Near Real-time Search
• Real-time get
• Optimistic Concurrency
3
3
Solr Cloud
• Distributed Indexing designed from the ground up to
accommodate desired features
• CAP Theorem
• Consistency, Availability, Partition Tolerance (saying goes “choose 2”)
• Reality: Must handle P – the real choice is tradeoffs between C and A
• Ended up with a CP system (roughly)
• Value Consistency over Availability
• Eventual consistency is incompatible with optimistic concurrency
• Closest to MongoDB in architecture
• We still do well with Availability
• All N replicas of a shard must go down before we lose writability for that
shard
• For a network partition, the “big” partition remains active (i.e. Availability
isn’t “on” or “off”)
4
4
Solr 4
5
5
Solr 4 at a glance
• Document Oriented NoSQL Search Server
• Data-format agnostic (JSON, XML, CSV, binary)
• Schema-less options (more coming soon)
• Distributed
• Multi-tenanted
• Fault Tolerant
• HA + No single points of failure
The desire for these
• Atomic Updates
features drove some
• Optimistic Concurrency
of the “SolrCloud”
architecture
• Near Real-time Search
• Full-Text search + Hit Highlighting
• Tons of specialized queries: Faceted search, grouping,
pseudo-join, spatial search, functions
6
6
Quick Start
1. Unzip the binary distribution (.ZIP file)
Note: no “installation” required
2. Start Solr
$ cd example
$ java –jar start.jar
2. Go!
Browse to http://localhost:8983/solr for the new admin
interface
7
7
New admin UI
8
8
Add and Retrieve document
$ curl http://localhost:8983/solr/update -H 'Content-type:application/json' -d '
[
{ "id" : "book1",
"title" : "American Gods",
"author" : "Neil Gaiman"
Note: no type of “commit”
}
]'
is necessary to retrieve
$ curl http://localhost:8983/solr/get?id=book1
{
"doc": {
"id" : "book1",
"author": "Neil Gaiman",
"title" : "American Gods",
"_version_": 1410390803582287872
}
}
9
9
documents via /get
(real-time get)
Simplified JSON Delete Syntax
• Singe delete-by-id
{"delete":”book1"}
• Multiple delete-by-id
{"delete":[”book1”,”book2”,”book3”]}
• Delete with optimistic concurrency
{"delete":{"id":”book1", "_version_":123456789}}
• Delete by Query
{"delete":{”query":”tag:category1”}}
10
10
Atomic Updates
$ curl http://localhost:8983/solr/update -H 'Content-type:application/json' -d '
[
{"id"
: "book1",
"pubyear_i" : { "add" : 2001 },
"ISBN_s"
: { "add" : "0-380-97365-1"}
}
]'
$ curl http://localhost:8983/solr/update -H 'Content-type:application/json' -d '
[
{"id"
"copies_i"
"cat"
"ISBN_s"
"remove_s"
]'
11
11
:
:
:
:
:
"book1",
{ "inc" :
{ "add" :
{ "set" :
{ "set" :
1},
"fantasy"},
"0-380-97365-0"}
null } }
Optimistic Concurrency
• Conditional update based on document version
client
1. /get document
Solr
4. Go back to
step #1 if fail
code=409
3. /update resulting
document
12
12
2. Modify
document,
retaining
_version_
Version semantics
• Specifying _version_ on any update
invokes optimistic concurrency
_version
_
>1
13
Update Semantics
1
Document version must exactly match supplied
_version_
Document must exist
<0
Document must not exist
0
Don’t care (normal overwrite if exists)
13
Optimistic Concurrency Example
$ curl http://localhost:8983/solr/get?id=book2
{ "doc” : {
"id":"book2",
Get the document
"title":["Neuromancer"],
"author":"William Gibson",
"copiesIn_i":7,
"copiesOut_i":3,
"_version_":123456789 }}
$ curl http://localhost:8983/solr/update -H 'Content-type:application/json' -d '
[
]'
{
"id":"book2",
"title":["Neuromancer"],
"author":"William Gibson",
"copiesIn_i":6,
"copiesOut_i":4,
"_version_":123456789 }
Modify and resubmit, using
the same _version_
Alternately, specify
the _version_ as a
request parameter
curl http://localhost:8983/solr/update?_version_=123456789 -H 'Content14
type:application/json'
-d […]
14
Optimistic Concurrency Errors
• HTTP Code 409 (Conflict) returned on version mismatch
$ curl -i http://localhost:8983/solr/update -H 'Content-type:application/json' -d '
[{"id":"book1", "author":"Mr Bean", "_version_":54321}]'
HTTP/1.1 409 Conflict
Content-Type: text/plain;charset=UTF-8
Transfer-Encoding: chunked
{
"responseHeader":{
"status":409,
"QTime":1},
"error":{
"msg":"version conflict for book1 expected=12345
actual=1408814192853516288",
"code":409}}
15
15
Schema
16
16
Schema REST API
• Restlet is now integrated with Solr
• Get a specific field
curl
http://localhost:8983/solr/schema/fields/price
{"field":{
"name":"price",
"type":"float",
"indexed":true,
"stored":true }}
• Get all fields
curl http://localhost:8983/solr/schema/fields
• Get Entire Schema!
curl http://localhost:8983/solr/schema
17
17
Dynamic Schema
• Add a new field (Solr 4.4)
curl -XPUT http://localhost:8983/solr/schema/fields/strength -d ‘
{"type":”float", "indexed":"true”}
‘
• Works in distributed (cloud) mode too!
• Schema must be managed & mutable (not currently the default)
<schemaFactory class="ManagedIndexSchemaFactory">
<bool name="mutable">true</bool>
<str name="managedSchemaResourceName">managed-schema</str>
</schemaFactory>
18
18
Schemaless
• “Schemaless” really normally means that the client(s) have an implicit
schema
• “No Schema” impossible for anything based on Lucene
• A field must be indexed the same way across documents
• Dynamic fields: convention over configuration
• Only pre-define types of fields, not fields themselves
• No guessing. Any field name ending in _i is an integer
• “Guessed Schema” or “Type Guessing”
• For previously unknown fields, guess using JSON type as a hint
• Coming soon (4.4?) based on the Dynamic Schema work
• Many disadvantages to guessing
• Lose ability to catch field naming errors
• Can’t optimize based on types
• Guessing incorrectly means having to start over
19
19
Solr Cloud
20
20
Solr Cloud
http://.../solr/collection1/query?q=awesome
shard1
shard2
Load-balanced
sub-request
replica1
replica2
replica2
replica3
ZooKeeper
quorum
ZK
node
/collections
/collection1
configName=myconf
ZK
node
21
ZK
nod
e
/shards
/shard1
server1:8983/solr
server2:8983/solr
/shard2
server3:8983/solr
server4:8983/solr
21
replica1
replica3
/livenodes
server1:8983/solr
server2:8983/solr
/configs
/myconf
solrconfig.xml
schema.xml
/clusterstate.json
/aliases.json
ZK
node
ZK
nod
e
ZooKeeper holds cluster state
• Nodes in the cluster
• Collections in the cluster
• Schema & config for each collection
• Shards in each collection
• Replicas in each shard
• Collection aliases
Distributed Indexing
http://.../solr/collection1/update
shard1
•
•
•
•
22
shard2
Update sent to any node
Solr determines what shard the document is on, and forwards to shard leader
Shard Leader versions document and forwards to all other shard replicas
HA for updates (if one leader fails, another takes it’s place)
22
Collections API
Create a new document collection
http://localhost:8983/solr/admin/collections?
action=CREATE
&name=mycollection
&numShards=4
&replicationFactor=3

Delete a collection
http://localhost:8983/solr/admin/collections?
action=DELETE
&name=mycollection

Create an alias to a collection (or a group of collections)
http://localhost:8983/solr/admin/collections?
action=CREATEALIAS
&name=tri_state
&collections=NY,NJ,CT

23
23
http://localhost:8983/solr/#/~cloud
24
24
Distributed Query Requests

Distributed query across all shards in the collection
http://localhost:8983/solr/collection1/query?q=foo

Explicitly specify node addresses to load-balance across
shards=localhost:8983/solr|localhost:8900/solr,
localhost:7574/solr|localhost:7500/solr



A list of equivalent nodes are separated by “|”
Different phases of the same distributed request use the same node
Specify logical shards to search across
shards=NY,NJ,CT

Specify multiple collections to search across
collection=collection1,collection2

public CloudSolrServer(String zkHost)


25
ZK aware SolrJ Java client that load-balances across all nodes in cluster
Calculate where document belongs and directly send to shard leader (new)
25
Durable Writes
• Lucene flushes writes to disk on a “commit”
• Uncommitted docs are lost on a crash (at lucene level)
• Solr 4 maintains it’s own transaction log
• Contains uncommitted documents
• Services real-time get requests
• Recovery (log replay on restart)
• Supports distributed “peer sync”
• Writes forwarded to multiple shard replicas
• A replica can go away forever w/o collection data loss
• A replica can do a fast “peer sync” if it’s only slightly out of date
• A replica can do a full index replication (copy) from a peer
26
26
Near Real Time (NRT) softCommit
• softCommit opens a new view of the index without
flushing + fsyncing files to disk
• Decouples update visibility from update durability
• commitWithin now implies a soft commit
• Current autoCommit defaults from solrconfig.xml:
<autoCommit>
<maxTime>15000</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>
<!--
27
<autoSoftCommit>
<maxTime>5000</maxTime>
</autoSoftCommit> -->
27
Document Routing
numShards=4
router=compositeId
id = BigCo!doc5
(MurmurHash3)
9f2
7
hash
ring
shard4
shard1
400000007fffffff
80000000-bfffffff
00000000-3fffffff
c0000000-ffffffff
shard3
3c71
shard2
q=my_query
shard.keys=BigCo!
(hash)
9f27
0000 to
9f27
shard1
28
28
ffff
Seamless Online Shard Splitting
update
Shard1
leader
replica
Shard2
leader
replica
Shard2_0
Shard3
leader
replica
Shard2_1
1. http://localhost:8983/solr/admin/collections?action=SPLITSHARD&col
lection=mycollection&shard=Shard2
2.
3.
4.
5.
29
New sub-shards created in “construction” state
Leader starts forwarding applicable updates, which are buffered by the sub-shards
Leader index is split and installed on the sub-shards
Sub-shards apply buffered updates then become “active” leaders and old shard
becomes “inactive”
29
Questions?
30
30
Download