Elastic Search

6 min readMar 31, 2020

Elasticsearch is an open source analytics and full-text search engine.

You can build complex functionality with Elasticsearch like search, auto-completion, correcting typos, highlighting matches, handling synonyms, adjusting relevance, etc.

You can also query structured data such as numbers and aggregate data, and use Elasticsearch as an analytics platform. You can write queries that aggregate data and use the results for making pie charts, line charts, or whatever you might need.

You can use machine learning to forecast sales based on historical data or another thing you can do, is anomality detection.

Cluster a collection of related nodes that together contain all of our data.
Node A node is essentially an instance of Elasticsearch that stores data. A node refers to an instance of Elasticsearch and not a machine, so you can run any number of nodes on the same machine.
Indexes Every document within Elasticsearch, is stored within an index. An index is therefore a collection of documents that have similar characteristics and are logically related.
Sharding is a way to divide an index into separate pieces, where each piece is called a shard. Just to be clear, a shard may be placed on any node, so if an index has five shards, for instance, we don't need to spread these out on five different nodes. We could, but we don't have to. Each shard is actually a Lucene index. There is a limit to the number of documents a shard can store, being just over two billion documents. Default # shards for an index is one.
Replication is also configured at the index level. Replication works by creating copies of each of the shards that an index contains. These copies are referred to as replicas or replica shards. A shard that has been replicated one or more times, is referred to as a primary shard.
Documents data is stored as documents, which is just a unit of information. A document in Elasticsearch corresponds to a row in a relational database.
The JSON object that we send to Elasticsearch is stored within a field named “_source,”
Fields A document then contains fields, which correspond to columns in a relational database.
Snapshots provide a way to take backups so that you can restore data to a point in time. You can either snapshot specific indices, or the entire cluster.
Mappings are used to define how documents and their fields should be stored and indexed. The point of doing this, is to store and index data in a way that is appropriate for how we want to use our data. A couple of examples of what mappings can be used for, could be to define which fields should be treated as full text fields, which fields contain numbers, dates, or geographical locations. You can also specify the date formats for date fields, and also specify analyzers for full text fields. You can kind of think of mappings in Elasticsearch as the equivalent of defining a schema for a table in a relational databases, such as MySQL.

When we get to searching for data, you will see that we specify the index that we want to search for documents, meaning that search queries are actually run against indices.

Graph is all about the relationships in your data.
An example could be that when someone is viewing a product on an ecommerce website, we want to show related products on that page as well.

But to make this work, it is important to distinguish between popular and relevant. Suppose that a lot of people listen to Linkin Park and they also enjoy listening to Mozart every now and then. That does not suggest that the two are related, but the strong link between them is just caused by the fact that they are both relatively popular. For example, if you go out on the street and ask 10 people if they use Google, most of them will say yes. But that doesn’t mean that they have anything else in common; that’s just because Google is so popular for all kinds of different people. On the other hand, if you ask ten people if they use Stackoverflow, the ones that say yes do have something in common, because Stackoverflow is specifically related to programming. So essentially what we are looking for, is the uncommonly common, because that says something about relevance and not popularity. The point is that purely looking at the relationships in data without looking at relevance, can be misleading. That’s why Graph uses the relevance capabilities of Elasticsearch when determining what is related and what isn’t.

Inside “bin” directory :
elasticsearch-plugin used to install plugins , respectively.
elasticsearch-sql-cli used to run Elasticsearch SQL queries.

Inside “config” directory :
elasticsearch.yml file is the main configuration file, its best practise to define :
cluster.name cluster name
node.name node name
specify paths to various directories for storing data and logs.

jvm.options

For a production environment, it is good practice to store data, logs, and configuration files outside of the Elasticsearch directory. The reason for that is that you can then remove the entire Elasticsearch directory without losing valuable data. This is useful when upgrading Elasticsearch, and also just as a precautionary measure.

Node Roles

Master role, which makes a node eligible for being the cluster’s master node.
Essentially a master node is responsible for performing cluster-wide actions.
This mainly includes creating and deleting indices, keeping track of nodes, and allocating shards to nodes.
Data role enables a node to store a part of the cluster’s data. Not only that, it also performs queries such as search queries and modifications of data. Storage of data therefore goes hand in hand with serving queries that are related to the stored data.
Ingest enables a node to run ingest pipelines. An ingest pipeline is a series of steps that should be performed when ingesting a document into Elasticsearch. The steps are formally referred to as processors, and they may manipulate documents before they are added to an index, such as adding and removing fields, or changing values.
Coordination node. manages how Elasticsearch distributes queries internally. Having dedicated coordination nodes is only really useful for large clusters, as it can essentially be used as a load balancer.

If you intend to make use of machine learning, there are two settings related to this feature. The first one, named “node.ml,” identifies a node as a machine learning node if set to “true.” This enables the node to run machine learning jobs. The second setting is named “xpack.ml.enabled,” which enables or disables the node’s capability of responding to machine learning API requests. If you make use of machine learning, then these two settings enable you to run dedicated machine learning nodes. That might be useful if you don’t want your background machine learning jobs to slow other things down, such as search requests.

Routing the process of resolving a document’s shard, both in the context of retrieving the document, but also to figure out on which shard it should be stored in the first place.

Elasticsearch documents are immutable, means cannot be changed. Update api replaces the document. Specifically, it retrieved the document, changed its fields according to our specification, and reindexed the document with the same ID, effectively replacing it. The existing document was not updated with any new fields or values; it was replaced entirely.

Upserting means to conditionally update or insert a document based on whether or not the document already exists.

By default, each text field is mapped using both the “text” type and the “keyword” type. The difference between the two, is that the “text” type is used for full-text searches, and the “keyword” type for exact matches, aggregations and such.

Apis

GET /_cluster/health
GET /_cat/indices?v
GET /_cat/shards?v

Elastic Search

Written by Himanshu Lohiya